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Materials and Methods relating to Breast Cancer 

Classification 

Field of the Invention 

The present invention concerns materials and methods 
relating to the classification of breast cancers. 
Particularly, the present invention concerns the 
determination of the prognosis of breast cancers. 

Background of the Invention 

There has been an intense interest in the use of gene 
expression data for biological classification, particularly 
in the fields of oncology and medicine. One exciting aspect 
of this approach has been its ability to define clinically 
relevant subtypes of cancer that have previously eluded more 
traditional light -microscopy approaches. Despite this 
potential, a number of issues have to be resolved before the 
use of qene expression data for clinical diagnosis can 
become a reality. For example, algorithms need to" 
implemented that, besides delivering the correct 
classification, can also accurately determine the confidence 
of the prediction. This is particularly important if the 
classification affects the subsequent course of treatment - 
if 'furnished with such information, the treating physician 
can then weigh the confidence of prediction with the 
potential morbidity of a specific intervention to make an 
informed clinical choice. 

The Nottingham Prognostic Index (NPI) is a classification 
system based on tumour size, histological grade, and lymph 
node status, which is widely used in Europe and the UK for 
assigning prognoses to breast tumours (1-5) . Despite its 
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utility, it is acknowledged that the use of conventional 
histopathological parameters such as tumour grade and 
cellular morphology are also associated with certain 
limitations. Many of these variables (e.g. grade) are 
subject to significant inter- observer variability even after 
standardization attempts (6) . The NPI scale extends between 
values of 2 and 8. Appropriate cut-off points are often 
difficult. to define when the parameter being measured is 
scored over a continuous range of values (7), such as the 
NPI . 

The index therefore depends on a series of subjective 
criteria, which can result in discrepancies between 
observers in the assigned prognosis. 

The NPI is a scale of values; a patient that has a lower NPI 
value than another patient typically has a better prognosis 
than that of the other patient. Prognosis is typically 
defined using factors such as the chance ot survival over a 
particular timescale and/or chance of distant metastasis 
within a particular timescale (although not necessarily the 
same timescale as for survival) . Generally speaking 
therefore, a patient's outlook decreases with increasing NPI 
value . 

Determining a patient's prognosis is an important factor in 
determining the type, and extent of treatment for the 
patient. As a future treatment program may be associated 
with prognosis, the accuracy of the assigned prognosis is 
therefore critical. For example, van' t Veer et al . (10) have 
identified a- 70 gene "prognosis expression signature" (PES) 
that predicts the Disease Free Survival (DFS) status of 
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breast tumours. 

Summary of the Invention 

The present inventors studied expression data for a set of 
breast tumours but, initially, were unable to identify a set 
of genes whose expression is correlated to the NPI . The 
inventors hypothesized that there may be significant 
dif f erences ■ in gene expression between subtypes ("inter- 
subtype differences"), which potentially obscure more subtle 
patterns of variation within subtypes ( "intra -subtype 
differences") . It has been proposed that a significant 
proportion of the intrinsic gene expression variation in 
breast cancer can be attributed to different tumours 
belonging to distinct 'molecular subtypes' , such as ER+ and 
ER- (where ER is 'Estrogen Receptor ' ) (8- 9 , 14 ) . 

The dataset was segregated into respective molecular 
subcategories (ER+, ER - , ERBB2+-) using- unsupervised 
clustering techniques. Each molecular subtype was treated as 
an independent data set. Tumours within each subtype were 
independently analysed to define a set of genes whose level 
of expression correlates to the NPI. 

Clinicians generally divide the NPI scale into three 
categories: *good' prognosis, ^moderate' prognosis and 
'poor' prognosis . The values that define the category 
boundaries vary depending on the clinician. An example of a 
typical set of boundaries is: good prognosis NPI . < 3.4; 
moderate prognosis 3.4 =< NPI =< 5.4; and poor prognosis NPI 
> 5.4. Those skilled in the art will realise that these 
boundaries may be varied. 
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The present inventors have identified a set of 62 genes that 
are differentially expressed in tumours of differing- 
prognoses, e.g. differentially expressed in tumours with- a 
high NPI (and therefore poor prognosis) compared to tumours 
with a low NPI (and therefore good prognosis) . 

Although the set of genes was identified after classifying 
samples according to their NPI, it has also been found that 
classifying tumour, samples using the expression levels of 
these genes correlates with other measures of prognosis (e.g. 
disease- free survival) . 

Accordingly, the expression levels of these genes in a tumour 
sample have significant medical implications for the 
prognosis and treatment of the patient from whom the sample 
was derived. In particular, they may be used to classify a 
tumour sample, as an indicator of the prognosis of the 
patient. 

Values ranging from 3.8 to 4.6 on the NPI scale were used as 
cut-off points between "good" and "bad" prognosis and the 
same set of 62 differentially expressed genes were identified 
using each cut-off value. 

This indicates that, although NPI covers a continuous 
spectrum of values from 2 to 8 , the expression levels of 
genes from the set* of 62 genes are capable of classifying 
tumour samples into discrete categories. Thus, samples 
exhibiting continuous NPI values based upon histopathological 
parameters may be separable into discrete categories at the 
molecular level. 
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Moreover, comparison of prognoses assigned to breast tumour 
patients using (i) the methods . of the invention and (ii) 
clinical techniques . (usually histopathological techniques), 
indicates that, based on patient data such as DFS and Kaplan- 
Meier survival curves, the methods of the invention may 
provide a more accurate prognosis than histopathological 
techniques . 

The 62 genes are identified in Table S6 . The following 
description will make use of the term "expression profile" . 
This refers to the expression levels for a set of genes in a 
sample. Unless the "context requires otherwise, the set of 
genes will include some or all of the 62 genes identified in 
Table S6. 

The 62 genes identified herein overlap by one gene only 
(DC13 or Hs. 6879) with the genes identified in the PES of 
van't Veer et al . (10) . The PES is the first 70 genes Itne 
genes that exhibit the most significant difference in 
expression between groups showing different disease free 
survival rates) of an extended geneset of 231 Rosetta genes 
(10) . There are 8 genes common to the 62 genes of Table S6 
and the 231 Rosetta genes, which eight genes are listed in 
Table S13 . 

Two genes in table S6 are highly expressed in low NPI 
tumours (the "Negative genes"), whilst 60 of the genes. are 
highly expressed in high NPI tumours, (the "Positive- genes" ) 

Accordingly, at its most general, the present invention 
provides a method for deriving a set of differentially 
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expressed genes. The invention also provides methods and 
assays for the classification and/or assignment of a 
prognosis to a breast tumour sample. The invention 
identifies a set of genes and provides the use of the 
expression levels of some or all of those genes in a breast 
tumour sample in assigning a prognosis to the patient from 
whom the sample was derived. 

In a first aspect, the present invention provides a method 
for determining the prognosis of a patient with breast 
cancer, the method comprising assigning a prognosis to' the 
patient based on the expression levels in a breast tumour of 
said patient of a set of genes (hereafter referred to as the 
"prognostic set"), wherein the prognostic set includes a 
plurality of genes from Table S6 . 

The invention further provides the use of the prognostic set 
in determining the prognosis of a patient with breast cancer. 
Preferably, the invention provides the use of an expression . 
profile in determining the prognosis of a patient with a 
breast tumour, the expression profile representing the 
expression levels in the tumour of the genes of the 
prognostic set. 

"Prognosis" is intended in its most general sense, and may be 
quantitative or qualitative. It may be expressed in general 
terms, such as a "good" or "bad" prognosis, and/or in terms 
of likely clinical outcomes, such as duration of disease free 
survival (DFS) , likelihood of survival for a defined period 
of time, and/or probability of distant metastasis within a 
defined period, of time. Quantitative measures of prognosis 
will generally be probabilistic. Additionally or 
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alternatively, and especially for communicating the prognosis 
to or between medical ■ practitioners , the prognosis may be . 
expressed' in terms of another indicator of prognosis,, such as 
the NPI scale. 

In general, a patient with a *good prognosis' tumour would 
probably be treated with a conventional treatment regimen. A 
patient with a *poor prognosis' tumour might be treated with 
ah alternative or more aggressive regimen. The *poor 
prognosis' patient would usually not have to wait for the 
conventional treatment regimen to fail before moving onto the 
more aggressive one. Furthermore, having an understanding of 
the likely clinical course of the disease allows a patient to 
prepare a realistic plan for future, which is an important 
social aspect of cancer treatment. 

For the avoidance of doubt, the term "determining" need not 
imply absolute certainty in prognosis. Rather, the 
expression levels of the prognostic set in a tumour will 
generally be indicative of the likely prognosis of the 
patient , 

The expression levels will generally be represented 
numerically. The expression profile therefore will generally 
include a set of numbers, each number representing the 
expression level of a gene of the prognostic set. 

A method in accordance 'with the first aspect of the 
invention may comprise, the steps of : 

providing an expression profile that represents the 
expression levels in the tumour of the genes of the 
prognostic set, and 
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assigning a prognosis to the patient based on the 
expression profile. 

The providing step may include extracting information on the 
expression levels of the genes of the prognostic set from a 
pre-existing data set, Which may also include other 
expression levels (e.g. data representing expression levels 
of other genes in the tumour) . Alternatively, it may include 
determining the expression levels experimentally. 

The determining step may include the steps of: 

(a) obtaining a breast tumour sample from the patient; 

(b) measuring the expression' levels in " the sample of the 
genes of the prognostic set. 

Measurement of the expression level of a gene, and in 
particular its representation in the expression profile, may 
be in absolute terms, or relative to some other factor such 
as, but not limited to, the expression ot another gene, or a 
mean, median or mode of the expression level of a group of 
genes (preferably genes outside the prognostic set, but 
possibly including genes of the prognostic set) in the 
sample, or across a group of samples. For example, expression 
of a gene may be measured or represented ■ as a multiple or 
fraction of the average expression of a plurality of genes 
in the sample. Preferably, the expression is represented in 
the "expression profile as positive or negative to indicate 
an increase or decrease in expression relative to the 
average value . 

In a non-preferred embodiment, expression* prof ile 
information in the form of a set of numerical values is 
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converted into a ranked list of genes of the prognostic set, 
wherein the genes are ranked in order of expression level, 
after which the rank order of the individual genes is used 
as a parameter in the analysis (instead of the expression 
value of the gene) . 

Preferably, step (b) comprises contacting said expression 
products obtained from the sample with a plurality of binding 
members capable of binding to expression products that are . 
indicative of the expression of genes of the prognostic set, 
wherein such binding may be measured. 

Generally, the binding members are capable of not only 
detecting the presence of an expression product but its 
relative abundance (i.e. the amount of product available) . 
The expression profile can be determined using binding 
members capable of binding to the expression products of the 
prognostic set, e.g. mRNA, corresponding cDNA or cRNA or 
expressed polypeptide. By labelling either tne expression 
product or the binding member it is possible to identify the 
relative quantities or proportions of the expression 
products and determine the expression profile of the 
prognostic set. The binding members may be .complementary 
nucleic acid sequences or specific antibodies. 

The step of assigning a prognosis may be carried out by 
comparing the expression profile under test with other, 
previously obtained, profiles that are associated with known 
prognoses and/or with a previously determined "standard" 
profile (or profiles) which is (or are) characteristic of a 
particular prognosis (or prognoses) . A standard profile for a 
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particular prognosis may be generated from expression 
profiles from a plurality of tumours of that prognosis . 

The comparison will generally be performed by, or with the 
aid of, a computer. 

Preferably the expression profile is compared with known or 
standard profiles (preferably standard profiles) of differing 
known prognoses. The prognosis to be assigned to the patient 
is that of the known or standard profile which the expression 
profile under test most closely resembles. 

Pref erably " the comparison is "with known or standard profiles 
(preferably standard profiles) that are categorised into two 
different prognoses, e.g. "good" and "bad", or high and low 
NPI (preferably with a cut-off between 3.8 and 4.6). The 
known or standard profiles will have been generated from 
samples of known prognosis, which may be determined in any 
convenient way - either by actual clinical outcome for the 
patient following the removal of the sample, or by other 
prognostic techniques, e.g. histopathological techniques, 
e.g. using the NPI scale. 

The comparison may involve an assessment of the confidence 
level attributable to the prognosis, based on statistical 
techniques. The standard profiles are usually specific to the 
particular materials and methods (e.g microarray) from which 
they were derived. If a new materials and/or methods (e.g. a 
new type of microarray) aire adopted, the standard profiles of ' 
known prognoses are preferable obtained again using the 
prognostic set. 
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The methpd according to the first aspect of the invention may 
include classifying the sample of breast tumour as being of 
either high NPI "or low .NPI, or as either of good or bad 
prognosis , for example . 

As mentioned previously, the step of assigning a prognosis 
may be carried out by comparing the expression profile from 
the breast tumour sample under test with previously obtained 
profiles and/ or a previously determined "standard" profile 
which is characteristic of a particular prognosis, for 
example/ a *good' and/or a x poor' prognosis and/or at least 
one NPI value and/or at least one range of NPI values. The 
previously obtained prof iles may be stored as a database of 
profiles. 

Preferably the database includes gene expression profiles 
characteristic of a particular prognosis. The gene expression 
profiles are preferably produced from expression levels of 
the same prognostic set (a subset of the genes of Table S6) 
as the prognostic set of the first aspect of the invention, 
or a prognostic set (potentially a different subset from 
above) sufficiently overlapping the prognostic set of the 
first aspect so as to provide a statistically significant 
base for comparison of the expression levels.. The computer 
may be programmed to report the statistical similarity 
between the profile under test and the standard profile (s) so 
that a prognosis may be assigned. 

Advantageously, the use of a .gene expression' profile to 
assign a prognosis may reduce or may even eliminate the 
subjective nature of the clinical procedures used to assign a 
prognosis to a tumour sample. As the method requires 
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assessment of expression products at the molecular level, 
preferably quantitatively, the method provides a - more 
objective, and therefore- potentially more reliable, way to 
assign a prognosis. The prognostic set is, as mentioned 
earlier, capable of separating breast tumour sample.s into 
discrete categories, and therefore reducing, or even 
eliminating, the subjective analysis of clinical prognostic 
assignment. Furthermore, a confidence can be assigned to the 
prediction, so that an informed choice regarding treatment of 
the patient can be made, depending on the "strength" of the 
prognosis . 

The' expression profile of the prognostic set may "differ 
slightly between independent samples of similar prognosis. 
However, the inventors have realised that the expression 
profile of the particular genes that make up the prognostic 
set when used in combination provide a pattern of expression 
(expression profile) in a tumour sample, which pattern is 
characteristic of the tumour's prognosis. 

The inventors have found that the prognostic set is capable 
of resolving tumour samples into high NPI and low NPI 
classes. By high NPI it is meant an NPI of preferably at 
least 3.4, preferably at least 3.5, more preferably at least 
3.6, more preferably at least 3.7, more preferably at least 
3.8, more preferably at least 3.9 and most preferably at 
least 4.0. High NPI may be at least 4.1, at least 4.2, at 
least 4.3, at least 4.4, at least 4.5, or at least 4.6. The 
preferred cut-off value between high and low NPI is between 
3.8-4.6. 
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Historically, the *good' , ^moderate' and 'bad'/ 'poor' 
categories of NPI were determined using large clinical 
studies in which patients belonging to' these different groups 
exhibited statistically significant differences in overall 
survival. For example, patients with good prognosis may have 
a ten-year survival rate of about 83%, patients with 
* moderate' prognosis may have a ten-year survival rate of 
about 52%, and patients with x poor' or 'bad' prognosis may 
have a ten-year survival rate of about 13% (4) . 

In particular, the prognostic set seems to be correlated 
most strongly to tumour. prognosis (as reflected by NPI) in 
Estrogen Receptor positive tumours (ER+ ) . 

The classification of breast tumours into Estrogen Receptor 
positive (ER+) and negative (ER-) subtypes is an. important 
distinction in the treatment of breast cancer. ER- tumours 
are in aeneral more clinically aggressive than their ER+ 
counterparts, and ER+ tumours are routinely treated using 
anti- hormonal therapies such as tamoxifen (21) . Breast 
tumours may be classified as ER+ or ER- using histological 
techniques (e.g. with antibodies specific for the receptor) 
-or— u-s-ing— gene. exp.rjes,sipii..techniques . Presently, a tumour's 
ER status is routinely determined by immunohistochemistry 
(IHC) or immunoblotting using an antibody to ER. 

The first aspect of the invention preferably includes a step 
of determining the ER status of the tumour sample. The ER 
status may be determined using gene expression analysis, or 
by using histopathological techniques. Preferably, the first, 
aspect of the invention further includes, as an initial step, 
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determining the ER status of the tumour sample, and 
proceeding only if the status is ER+ . 

Preferably the ER status of the tumour sample is determined 
using gene expression profiling as described in our co- 
pending application PCT/GB03/000755 . Gene expression 
profiling is capable of classifying breast tumours as ER+ or 
ER-, with high confidence. However, there is also a third 
category of tumours that could not be classified as ER+ or 
ER- with significant statistical certainty (*low confidence' 
tumours) . Upregulation of ERBB2+ is frequently associated 
with low confidence tumours. Preferably, only ER+ tumours 
identified with high confidence (preferably classified as ER+ 
with a prediction strength of magnitude greater than 0.4 as 
determined using the methods of PCT/GB03/000755) are assessed 
using the methods according to the first aspect of the 
invention. 

The step of assigning a prognosis to the breast tumour sample 
may comprise the use of statistical and/or probabilistic 
techniques, such as Weighted Voting (WV) (13), a supervised 
learning technique. In WV, binary classifications may be 
performed. That' is, the technique may be used to assign a 
sample to one of two classes. The expression level- of each 
gene in the prognostic set of the breast tumour sample is 
compared to the mean average level of expression of that gene 
across the different classes. The mean average may, for 
example, be calculated from expression profiles that have an 
assigned prognosis, e.g. database of expression profiles of 
* known' prognosis. - 
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The difference between the expression level and the mean 
average gene expression across the classes is weighted and 
corresponds to a "vote'" for that gene for a particular class 
and ah equal, but negative, vote for that gene against the 
other class. For a particular tumour, the votes (positive and 
negative) for all the genes are summed together for each 
class to create totals for each class. The tumour is assigned 
to the class having the highest (positive) total. The margin 
of victory of the winning class can then be expressed as 
prediction strength. 

The difference in expression level is weighted using a 
formula that includes mean and standard deviations of 
expression levels of the genes in each of the .two classes. 
Generally, the mean and standard deviations for each class 
are calculated from expression profiles that have, .or 
represent, a particular prognosis e.g. high NPI and low NPI. 

Additionally, or alternatively, the step of assigning a 
prognosis may comprise the use of hierarchical clustering, 
particularly if expression levels in the tumour sample have 
been determined using different materials and/or methods from 
-those used to determine the expression profiles with x known| 
prognoses, or standard profile (s) to which the sample 
expression profile is compared: 

The assigned prognosis may be validated using an established 
leave-one-out cross validation (LOOCV) assay (see examples) . 
Step (c) may be performed using a computer. 

In Hierarchical Clustering, each expression profile can be 
represented as a vector that consists of n genes where (gl f 
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g2..gn) represent the expression levels of the genes. Each 
vector is then compared with the vector for every other 
profile in the analysis," and the two vectors with the highest 
correlation to one another are paired together until as many 
profiles as possible in the analysis have been paired up. 

There are many ways known in the art to calculate the 
correlation, such as the Pearson's correlation coefficient 
(22) . In the next step, a composite vector is then derived 
from each pair (in average -linkage clustering this is usually 
the average of both profiles) , and then the process of 
pairing is repeated. This continues until all vectors have 
been paired together, to assemble a "tree" representing all 
the profiles. The process is 'hierarchical' as one starts 
from the bottom (individual profiles) and builds up. In the 
present invention, individual profiles build up to preferably 
two composite vectors, each vector representing a class (i.e. 
nnnH ot had nroonosis) . For a new sample of unknown class, 
the sample is clustered with the standard profiles/samples. 
The class of 4 unknown' sample will be determined based on 
which cluster/vector it belongs to at the end of the 
iterative rounds of pairing. 

By expression profiles with v known' or assigned prognosis / 
prognoses, it is meant an expression profile to which a 
prognosis has been assigned or derived. The prognosis may 
have been: calculated from gene expression data; derived from 
clinical techniques "performed on the source sample (e.g- 
histopathological techniques) ; or assigned retrospectively 
based on the actual disease progression / outcome in the 
patient from which the expression profile was derived-. The 
third option is most preferable, as an accurate prognosis 
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(for the point in time at which the sample was obtained; can 
be assigned, based on the subsequent outcome for the patient, 
from the patient's medical records. In such retrospective 
assignment, the. use of hindsight provides accuracy. 

The methods of the invention may be used to assess the 
efficacy of treatment of a patient with breast cancer. The 
prognosis of the patient may be assigned before, or at an 
"early stage of, treatment and compared to the prognosis 
assigned to the patient after treatment (or at a late stage 
of treatment) . The prognosis before and / or after treatment 
is preferably assigned using a method according to the 
invention. If the treatment comprises stages, the expression 
profile may be determined after each stage to plot the 
progress of the treatment . An improved prognosis after 
treatment indicates a successful, or at least partially 
successful, treatment. The treatment may be chemotherapy. 

The methods of the invention may include comparing the 
expression levels of the prognostic set in the breast tumour 
sample before and after treatment to detect a change in the 
expression profile indicative of an improved prognosis or 
worsened prognosis . 

The method may include detecting downregulation of genes in 
the prognostic set that are indicated in Table S6 to be 
'upregulated' and/ or upregulation of genes in the prognostic 
set that are indicated in Table S6 to be 'downregulated' ..The 
said genes may be downregulated/upregulated compared to 
standard values (e.g. the average expression level across a 
range of samples of differing prognosis) , and/or compared to 
previous values, for example a standard profile indicative or 
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characteristic of a 'poor' prognosis. The downregulation of 
the 'upregulated' genes and/ or upregulation of the 
'downregulated' genes is indicative of a good or moderate 
prognosis. The extent of the change in regulation may 
indicate the efficacy of the treatment. 

The inventors have found that a change in expression profile 
towards that of a good prognosis tumour is indicative of 
successful treatment . Tumours that exhibit such a change- in 
expression profile have the best prognosis (e.g. the best 
survival rates, the best disease free survival rates) . The 
expression profile of the tumour at pre- and post- treatment 
stages may be compared to standard profiles of known 
prognosis . 

The method may therefore comprise assigning the expression 
profile of a breast tumour to either good or bad prognosis 
class, (or high or low NPI class) , and assigning a second 
expression profile, determined from said tumour at a later 
stage of treatment, to either good or bad prognosis class (or 
high or low NPI class), and detecting a change in class, 
wherein a change from bad prognosis to good prognosis (or 
high" NPI to low NPI) is indicative of an effective treatment. 
Additionally, or alternatively, a change in the statistical 
confidence level of assignment of good or bad prognosis class 
(or high or low NPI class) may indicate the efficacy of 
treatment. A decrease in the confidence of assignment of a 
class indicative of poor prognosis may suggest a successful, 
or at least partially successful, treatment. 

The methods of assessing the efficacy of treatment may 
include the step of determining , the ER status of the tumour. 
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However, the said methods of assessing efficacy are effective 
for assessing treatment efficacy of ER+, ER- and ERBB2+ 
tumours i.e. irrespective of the ER. status of the tumour. 

The expression prof ile ' represents the expression levels of a 
group of genes in the tumour. The genes of each expression 
profile need not be identical but there should be sufficient 
overlap between the genes of each expression profile to 
allow comparison and' grouping of the expression profiles. 

The binding member may be labelled for detection purposes 
using standard procedures known in the art. Alternatively, 
the expression products may be labelled following isolation 
from the sample under test. A preferred means of detection 
is using a fluorescent label which can be detected by a light 
meter. Alternative means of detection include electrical 
signalling. For example, the Motorola (Pasadena, California) 
e 7 sensor system has two probes, a "capture probe" which is 
freely floating, and a "signalling probe" which is attached 
to a solid surface which doubles as an electrode surface. 
Both probes function as binding members to the expression 
product. When binding occurs, both probes are brought into 
close proximity with each other resulting- in the creation of 
an electrical signal which can be detected. 

There are, however, a number of newer technologies that have 
recently emerged that utilize 1 label-free' techniques for 
quantitation, for example those produced by Xagros (Mountain 
View/ California) . The primers and/or the amplified nucleic 
acid may be devoid of any label. Quantitation may be 
assessed by measuring the change in electrical resistance as 
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a result of two primers docking onto a target expressed 
product, and subsequent extension by polymerase. 

As discussed above, the binding members may be 
oligonucleotide primers for use in a PCR (e;g. multi-plexed 
PCR) to amplify specifically the number of expressed products 
of the genetic identifiers. The products would then be 
analysed on a gel. However, preferably, the binding member 
is a single nucleic acid probe or antibody fixed to a solid 
support. The expression products may then be passed over the 
solid support, thereby bringing them into contact with the 
binding member. The solid support may be a glass surface, 
e.g. a microscope slide; beads (Lynx); or fibre-optics. In 
the case of beads, each binding member may be fixed to an 
individual bead and they are then contacted with the 
expression products in solution. 

Various methods exist in the art for determining expression 
profiles for particular gene sets and these can be applied to 
the present invention. For example, bead -based approaches 
(Lynx) or molecular bar-codes (Surromed) are known 
techniques. In these cases, each binding member is attached 
to a bead or "bar- code" .that is individually- readable and 
free-floating to ease contact with the expression products. 
The binding of the binding members to the expression products 
(targets) is achieved in solution, after which the tagged 
beads or bar-codes are passed through a device (e.g. a flow- 
cytometer) and read. 

A further known method of determining expression profiles is. 
instrumentation developed by Illumina (San Diego, 
California), namely, fibre-optics. In this case, each 
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binding member is attached to a specific "address" at the end 
of a fibre-optic cable. Binding of the expression product to 
the binding member may induce a fluorescent change which is 
readable by a device at the other end of the fibre-optic 
cable. 

The present inventors have successfully used a nucleic acid 
microarray comprising a plurality of nucleic acid sequences 
fixed to a solid support/ By passing nucleic acid sequences 
representing expressed genes e.g. cDNA, over the microarray, ■ 
they were able to create a binding profile characteristic of 
the expression products from a tumour sample with a 
particular prognosis, in particular a tumour sample with a 
good prognosis or a tumour sample with a bad prognosis or a 
tumour sample with a high NPI or a tumour sample with a low 
NPr . 

In a second asDect, the present invention provides apparatus, 
preferably a microarray, for assigning a prognosis to a 
breast tumour sample, which apparatus comprises a solid 
support to which are attached a plurality of binding members, 
each binding member being capable of specifically binding to 
an expression product of a gene of the prognostic set. 
Preferably the binding members attached to the solid support 
are capable of specifically and independently binding to 
expression products of at least 5 genes, more preferably, at 
least 10 genes or at least 15 genes, and most preferably at 
least 20 or 30 genes identified in Table S6 . The binding 
members attached to the solid support may be capable of 
specifically binding to expression products of 20 to 30 genes 
identified in Table S6. 
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In one embodiment, binding members being capable of 
specifically and independently binding to expression products 
of air genes identified in Table S6"are attached to the solid 
support.. The support may have attached thereto only binding 
members that are capable of specifically and independently 
binding to expression products of the genes identified in 
Table S6, or a prognostic set therefrom. 

The apparatus preferably includes binding members capable of 
specifically binding to expression products from the 
prognostic set, or to a plurality of genes thereof, and may 
include binding members capable of specifically binding to 
expression products of only an incomplete subset of the genes 
that are represented on the U133A microarray (though it may 
also include binding members for other genes not represented 
on the U133A microarray) . It is believed that the U133A 
microarray represents about 143 97 distinct genes. 
Accordingly, the apparatus preferably includes binding 
members for no more than 143 96 of the genes on the U13 3A 
microarray. The apparatus may include binding members capable 
of specifically binding to expression products of no more 
than 90% of the genes on the U133A microarray. The apparatus 
may include" binding members -capable of specifically binding 
to expression products of no "more than 80% or 70% or 50% or 
40% or 30% or 20% or 10% or 5% of the genes on the U133A 
microarray. 

Additionally- or alternatively, . the solid support may house 
binding members for no more than 14000, or no more than 
10000, or no more than 5000, or no more than 3000, or no more 
than '1000, or no more than 500, or no more than 400, or no 
more than 3 00, or no more than 200, or no more than 100, or 
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no more than 90, or no more than 80, or no more than 70, or 
no more than "60, or no more than 50, or no more than 40, or 
no more- than 30, or no more than 20, or no more than 10, or 
no more than 5 different genes; 

Preferably the binding members are nucleic acid sequences and 
the apparatus is a nucleic acid microarray. 

The genes of Table S6 are listed with their Unigene accession 
numbers corresponding to Build 160 of the Unigene database. 
The sequence of each gene can therefore be retrieved from the 
Unigene database at the National Institute of Health (NIH) : 
( ht tp : //www, hcbi . nlm . nih . gov/entrez/cruery. f cgi ?db=uhigene ) . 

Furthermore, for all of the genes, Affymetrix (Santa Clara, 
California) ( www, affymetrix.com ) provide examples of probe 
sets, including the sequences of the probes, (i.e. binding 
members in the form of oligonucleotide sequences) that are 
capable of detecting expression of the gene when used on a 
solid support. The probe details are accessible from the 
U133A section of the Affymetrix website using the Unigene ID 
of the target gene. 

If, in the future, one of the Unigene ID' s listed in the 
table were to be merged into a new ID, or split into two or 
more ID's (e.g. in a new build of the database) or deleted 
altogether, the sequence of the gene, as intended by the 
present inventors, is retrievable by accessing Build 160 of 
Unigene. 

Typically, high density nucleic acid sequences, usually cDNA 
or oligonucleotides, are fixed onto very small, discrete 
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areas or spots of a solid support. The solid support is 
often a microscopic glass side or a membrane filter, coated 
with a substrate (i.e. a "chip"). The nucleic acid sequences 
are delivered (or printed) , usually by a robotic system, onto 
the coated solid support and then immobilized or fixed to the 
support . 

In a preferred embodiment, the expression products derived 
from the sample are labelled, typically, using a fluorescent 
label, and then contacted with the immobilized nucleic acid 
sequences. Following hybridization, the fluorescent markers 
are detected using a detector, such as a high resolution 
laser scanner. In an alternative method, the expression 
products could be tagged with a non- fluorescent label, e.g. 
biotin. After hybridisation, the microarray could then be 
-stained' with a fluorescent dye that binds/bonds to the 
first non-fluorescent label (e.g. f luorescently labelled 
..strepavidin,. which binds to biotin) . The expression products 
may, however, be label-free, as discussed above. 

A binding profile indicating a pattern of gene expression 
(expression pattern or profile) is obtained by analysing the 
signal emitted from each -discrete spot with digital imaging 
software. The pattern of gene expression of the experimental 
sample may then be compared with that of a standard profile 
(i.e. an expression profile from a tissue sample with, for 
example, a known good or bad prognosis, or a known NPI value 
or known range of NPI values) for differential analysis. 

The standard may be derived from one or more expression 
profiles previously judged to be characteristic of a 
particular prognosis e.g. 'poor' or *good' prognosis and/or 
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of a particular NPI range such as high and/ or low NPI and/or 
characteristic of one or more NPI value (s) or one or more 
range (s) of values. The standard may be derived from one or 
more expression profiles previously judged to be 
characteristic of a particular NPI value or range of -values 
(or other defined value on a prognostic scale) . The standard 
may include an expression profile characteristic of a normal 
sample. These/This standard expression profile (s) may be 
retrievably stored on a data carrier as part of a database. 

Most microarrays utilize either one or two f luorophores . For 
two-colour arrays, the most commonly used f luorophores are 
Cy3 (green channel excitation) and Cy5 (red channel 
excitation) . The object of the microarray image analysis is 
to extract hybridization signals from each expression 
product. For one-colour arrays, signals are measured as 
absolute intensities for a given target (essentially for 
arrays hybridized to a single sample). For two-colour arrays, 
signals are measured as ratios of two expression products/ 
(e.g. sample and control (controls are otherwise known as a 
*ref erence' ) ) with different fluorescent labels. 

The apparatus in* accordance with the present invention 
preferably comprises a plurality of discrete spots, each spot 
containing one or more oligonucleotides and each spot 
representing a different binding member for an expression 
product of a gene selected from Table S6. In one embodiment, 
the microarray will contain spots for each of the genes 
provided in Table 36. Each spot will comprise a plurality of 
identical oligonucleotides each capable of binding to an 
expression product/ e.g. mRNA or cDNA, of the gene of Table 
S6 it is representing. Each gene is preferably represented by 
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a plurality of different oligonucleotides, preferably the 
Affymetrix U133A set of probes for the gene. 

In a third aspect of the present invention, there is provided 
a kit for assigning a prognosis to a patient with breast 
cancer, said kit comprising a plurality of binding members 
capable of specifically binding to expression products of 
genes of the prognostic set, and a detection reagent. The kit 
may include a data analysis tool, preferably in the form of a 
computer program. The data analysis tool preferably comprises 
an algorithm adapted, to discriminate between the expression 
profiles of tumours with differing prognoses. Preferably the 
algorithm is adapted to discriminate between a l good' 
prognosis and a 'poor' prognosis, most preferably between 
high NPI and low NPI tumours. The algorithm is preferably a 
weighted voting algorithm as described above. 

In one embodiment, the kit includes apparatus of the second 
aspect of the invention. 

The kit may include expression profiles from breast tumour 
samples with known prognoses (as discussed above) , and/or 
gene expression profiles characteristic of a particular 
prognosis (as discussed above) , preferably stored on a data 
carrier or other memory device. The profiles may have been 
analysed or grouped statistically, for example, mean average 
expression levels and/or gene weightings calculated. 

Preferably, the one or more binding members (antibody binding 
domains or nucleic acid sequences e.g. oligonucleotides) in 
the kit are fixed to one or more solid supports e.'g. a single 
support for microarray or fibre-optic assays, or multiple 
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supports such as beads. The detection means is preferably a 
label (radioactive or dye, e.g. fluorescent) for labelling 
the expression products of the sample under test. . The kit 
may also comprise, reagents for detecting and analysing the 
binding profile of the expression products under test. 

Alternatively, the binding members may be nucleotide primers 
capable of binding to the expression products of • genes 
identified in Table S6 such that they can be amplified* in a 
PCR. The primers may further comprise detection means, i.e. 
labels that can be used to identify the amplified sequences 
and their abundance relative to other amplified sequences. 

The breast tumour sample may be obtained as excisional 
breast biopsies or fine-needle aspirates. 

By creating a number of expression profiles of the 
.prognostic set from a number of tumour samples, each with ai 
assigned prognosis, preferably based on a prognostic scale, 
it is possible to create a library of profiles for good and 
bad prognosis. The greater the number of expression 
profiles, the easier it is to create a reliable 
characteristic expression profile standard (i.e. including 
statistical variation) that can be used as a standard irx a 
prognostic assay. Thus, a standard profile may be one that 
is devised from a plurality of individual expression 
profiles and devised within statistical variation to 
represent, for example, a *good' or x p6or' prognosis, or- a 
high NPI or a low NPI. 
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In a fourth aspect, there is provided a method of producing 
a nucleic acid expression profile for a breast tumour sample 
comprising the steps of 

(a) isolating, expression products from said breast 
tumour sample ; 

(b) identifying the expression levels of the prognostic 
set of genes; and 

(c) producing from the expression levels an expression 
profile for said breast tumour sample. 

The expression profile may be added to a gene expression 
profile database. The method may further comprise the step 
of comparing the expression profile with a second expression 
profile (or a plurality of second expression profiles) . The 
second expression profile (or profiles) may be produced frorr 
a second breast tumour sample (or samples) using 
substantially the same prognostic set, wherein a prognosis 
has .been assigned to, or determined for, the second sample 

(or samples) . The second expression profile (or profiles) 
may be a standard profile (or profiles) characteristic of a 
particular prognosis, for example a *good' prognosis or a 

*poor' prognosis, or a high NPI or a low NPI , or at least 
one' particular NPI value or at least one range of NPI 
values . 

Preferably the prognosis is in the form of a prognostic 
measure, preferably a clinically accepted prognostic 
classification system, such as the NPI . Again, the 
prognosis may be predicted from gene expression data, 
derived from clinical techniques, such as histopathological 
techniques, or assigned retrospectively to the second 
expression profile based on the disease outcome of the 
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patient (s) that contributed sample (s) from which the second 
profile" was derived. 

With knowledge of the prognostic set, it is possible to 
devise many methods for determining the expression pattern • 
or profile of the genes in a particular test sample. For 
example, the expressed nucleic acid (RNA, mRNA) can be 
isolated from the sample using standard molecular biological 
techniques. The expressed nucleic acid sequences 
corresponding to the gene members of the genetic identifiers 
given in Table S6 can then be amplified using nucleic acid 
primers specific for the expressed sequences in a PCR. If 
the isolated expressed nucleic acid is mRNA, this can be 
converted into cDNA for the PCR reaction using standard 
methods . 

The primers may conveniently introduce a label into the 
amplified nucleic acid so that it may be identified. 
Ideally, the label is able to indicate the relative quantity 
or proportion of nucleic acid sequences present after the 
amplification event, reflecting the relative quantity or 
proportion present in the original test sample. For 
example, if the label . is fluorescent or radioactive, the 
intensity of the signal will indicate the relative 
quantity/proportion or even the absolute quantity, of the 
expressed sequences. The relative quantities or proportions 
of the expression products of each of the genetic 
identifiers will establish a particular expression profile • 
for the test sample . 

The method according to the fourth aspect of the invention 
may comprise the steps of: 
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(a) isolating expression products from a first breast 
tumour sample; contacting said expression products with a 
plurality of binding members capable of specifically and 
independently binding -to expression products of the 
prognostic set; and creating a first expression profile from 
the expression levels of the prognostic set in the tumour 
sample; 

(b) isolating expression products from a second breast 
tumour sample of known prognosis (as defined previously) ; 
contacting said expression products with a plurality of 
binding members capable of - specif ically and independently 
binding to expression products, of the prognostic set of step 
(a) , so as to create a comparable second expression profile 
of a breast tumour sample; 

(c) comparing the first and second expression profiles 
to determine the prognosis of the first breast tumour 
sample . 

In a fifth aspect of the invention, there is provided an 
expression profile database comprising a plurality of gene 
expression profiles of breast tumour samples, wherein the 
gene expression profiles are derived from the expression 
levels of the prognostic set of genes, which database is 
retrievably held on a data carrier. The database is 
preferably produced by the method according to the fourth 
aspect of the invention. 

The expression profiles are preferably nucleic acid 
expression profiles. The determination of the nucleic acid . 
expression profile may be computerised and may be carried 
out within certain previously set parameters, to avoid false 
positives and false negatives. 
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The database may include expression profiles characteristic 
of a particular prognosis, such as good or bad prognosis > or 
of a particular prognostic value, preferably NPI value (e.g. 
high NPI, low NPI, or specific qualitative value or range of 
values) . The expression profiles may be categorised, 
according to the ER status (i.e. ER+ or ER- ) of the source 
tumour. The database may then be processed and analysed such 
that it will eventually contain (i) the numerical data 
corresponding to each expression profile in the database, 
(ii) a "standard" profile which functions as the canonical 
profile for a. particular prognostic assignment (e.g. good or 
bad prognosis, or value or range of- values, preferably from 
the NPI) ; and (iii) data representing the observed 
statistical variation of the individual profiles to the 
"standard" profile . 

Th?.. computer may then be able to provide an expression 
profile standard characteristic of a breast tumour sample 
with a particular prognosis, e.g. good prognosis and/or bad 
prognosis and/ or a high NPI and/or a low NPI. As stated 
earlier, the determined expression profiles may then be used 
to assign a prognosis to the breast tissue sample, 
preferably using a discriminating algorithm, most preferably 
a Weighted Voting algorithm, described above. 

The classification of the expression profile is more 
reliable the greater number of gene expression levels 
tested. The known microarray and genechip technologies allow 
large numbers of binding members to be utilized. Therefore, 
the more preferred method would be to use binding members 
representing all of the genes in Table S6 . However, the 
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skilled person will appreciate that a proportion of these 
genes may be omitted and the method still carried out in a 
reliable and statistically accurate fashion. 

The prognostic set in any aspect of the invention may 
comprise, or consist of, all, or substantially all, of the 
genes from Table S6 , or all, or substantially all of the 
Positive genes and/or all of the Negative genes. The 
prognostic set of genes may vary in content and number, 
independently, between aspects of the invention. 

The prognostic set may include at least 5, 10, 20, 30, 40, 
50, 60 or all of the genes of Table S6 . 

Preferably, the said prognostic set comprises, or consists 
of, about sixty or about fifty or about forty or about 
thirty or about twenty or about ten or about five Positive 
- genes f rom- Table S6 . Posit ive. genes from. Table S6 are 
preferably selected from the upper portion, preferably the 
upper half, of the list of Positive genes in Table S6 , as 
the genes are ranked in order of significance. 

The prognostic set may comprise one or both of, or may 
consist of both of, the Negative genes from Table S6 . 

The number and choice of genes are selected so as to provide 
a prognostic set that is at least capable of distinguishing 
between tumours with good prognosis and tumours with bad 
prognosis (or tumours with high NPI and tumours with low . 
NPI) . 
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The prognostic set may include no more than sixty genes of 
Table S6 . The prognostic set may comprise.no more than fifty 
genes 'of Table S6 . The prognostic set may include no more 
than forty genes of Table S6 . The prognostic . set may include 
no more than thirty genes of Table S6. The prognostic set 
may include no more than twenty genes of Table S6 . The 
prognostic set may include no more than ten genes of Table 
S6. The prognostic set may include no more than five genes 
of Table S6. 

The prognostic set may comprise., or consist essentially of, 
five to sixty genes of Table S6 . The prognostic set may 
comprise, or consist essentially of, ten to forty genes of 
Table S6. The prognostic set may comprise, or consist 
essentially of, ten to thirty genes of Table S6 . The 
prognostic set may comprise, or consist essentially of, ten 
to twenty genes of Table S6, or twenty to thirty genes of 
Table. S6., or, .preferably, thirty, to forty genes of Table S6. 

The prognostic set, preferably about ten or about twenty or 
about thirty genes, may be selected from the first about 
forty, or about thirty, or about twenty genes of Table S6 . 
About ten" genes may be selected from the first about fifteen 
genes of Table S6 . The about ten genes may be the first ten 
genes of Table S6 . 

The prognostic set may comprise, or consist essentially of, 
about forty -or about thirty or about twenty or about ten 
genes selected from the group 'consisting of the first about 
forty or about thirty or about twenty or about ten genes of 
the Positive genes of Table S6 and, optionally, one or both 
Negative Genes of Table S6 . The prognostic set may comprise, 
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or consist of, about thirty genes selected from the group 
consisting of- the first about thirty or about forty Positive 
genes of Table S6 and, optionally, one or both Negative- 
genes of Table S6 . 

The number of genes in the prognostic set that are in common 
with the U13 3A microarray is preferably limited as described 
above . 

The term x about' preferably means the number of genes stated 
plus or minus the greater of: 10% of the number of genes 
stated or one gene. 

The provision of the prognostic set allows diagnostic tools, 
e.g. nucleic acid microarrays to be custom made and used to 
predict, diagnose or subtype tumours. Further, such 
diagnostic tools may be used in conjunction with a computer 
which is programmed to determine the expression profile 
obtained using the diagnostic tool (e.g. microarray) and 
compare it, as discussed above, to a "standard" expression 
profile or a database of expression profiles of x known' 
prognosis. In doing so, the computer not only provides the 
user with information which may -be used diagnose the 
presence or type of a tumour in a patient, but at the same 
time, the computer obtains a further expression profile by 
which to determine the % standard' expression profile and so 
can update its own database. 

Thus, the invention allows, for .the first time/ specialized 
chips (microarrays) to be made containing probes 
corresponding to the prognostic set. The exact physical 
structure of the array may vary and range from 
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oligonucleotide probes attached to a 2 -dimensional solid 
substrate to free-floating probes which have been 
individually "tagged" with .a unique label, e.g. "bar code". 

Querying a database of expression profiles with known 
prognosis can be done in a direct or indirect manner. The 
"direct" manner is where the patient's expression profile is 
directly compared to other individual expression profiles in 
the database to determine which profile (and hence which 
prognosis) delivers the best match. Alternatively, the 
querying may be done more "indirectly", for example, the 
patient expression profile could be compared against simply 
the "standard" profile in the database for a particular 
prognostic assignment e.g. 'bad', or a prognostic value or 
range of values, preferably from the NPI e.g. high NPI . The 
advantage of the indirect approach is that the "standard" 
profiles, because they represent the aggregate of many 
individual profiles, will be much less data intensive and 
may be stored on a relatively inexpensive data carrier or 
other memory device (e.g. computer system) which may then 
form part of the kit (i.e. in association with the 
microarrays) in accordance with the present invention - 

In the direct approach, it is likely that the data carrier 
will be of a much larger scale (e.g. a computer server) , as 
many individual profiles will have to be stored. 

•By comparing the patient expression profile to the standard 
profile (indirect approach) and the pre-determined 
statistical variation in the population, it will also be 
possible to deliver a "confidence value" as to how closely 
the patient expression profile matches the "standard" 
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canonical profile, as discussed above. This value will 
provide the clinician with valuable information on the 
trustworthiness of the prognosis, and, for example, whether 
or riot the analysis should be repeated: 

As mentioned above, it is also possible to store the patient 
expression profiles on the database, and these may be used 
at any time to update the database . 

In a sixth aspect, the present invention provides a method 
for identifying a set of genes that are differentially 
expressed within a group of tumours, the method including 
providing an expression profile from each of a plurality of 
tumours of the group, classifying the profiles according to 
molecular subtype of tumour, and analysing expression 
profiles within a subtype to identify the set of genes, 
wherein the genes are differentially expressed within that 
subtype . 

This method differs from the method of van't Veer et al . (10) 
in that the initial selection of sporadic, lymph node 
negative breast tumours in van't Veer et al . involved 
-subtyping by clinical assessment, rather than subtyping at 
the molecular level . 

Of course, this aspect and the following aspects of the 
invention are closely related to the preceding aspects . 
Preferred features disclosed for the preceding aspects may 
therefore be applied also to this aspect and the following 
aspects, unless the context clearly requires otherwise. 



36 



WO 2005/033699 PCT/GB2004/004195 



In the context of the sixth, seventh and eighth aspects of 
the invention, the term ^expression profile" is not limited 
to the genes of the prognostic set. Rather'; it refers 
generally to the expression levels of genes in the tumours of 
the group, including (but not necessarily only) the 
expression levels of genes that are differentially expressed 
within a molecular subtype. 

Differential expression of the set of genes derived by the 
sixth aspect of the invention (hereinafter 'the 
discriminating set' ) may be indicative or characteristic of 
a particular phenotype or genotype for tumours of the group. 
The method preferably includes the step" of correlating the 
differential expression of the discriminating set to a 
particular phenotype and/or genotype. The expression profile 
of the discriminating set in a number of samples of differing 
but known phenotype and/or genotype may be determined to 
establish a correlation between a particular gene expression 
profile of the discriminating set and a particular phenotype 
and/ or genotype. 

The differential expression may be characteristic of a 
clinical parameter or medical class assigned to the tumour 
as part of therapy or diagnosis of the patient with the 
tumour e.g. a measure of prognosis, such as an NPI value or 
NPI class. The differential expression of the discriminating 
set may allow a tumour sample to be assigned to one of at 
least two different 'genotypic or phenotypic classes. 

The method of the sixth aspect of the invention may further 
' include steps to assign a class to a tumour sample from a 
patient, wherein differential expression of genes of the 
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discriminating set are characteristic of the class, the 
steps including providing expression levels in the sample of 
the discriminating set, and assigning a class to the tumour 
based on the expression levels. 

The step of assigning the class may comprise the use of a 
statistical technique such as, but not limited to, Weighted 
Voting, Support Vector Machines or Hierarchical Clustering, 
as discussed previously. Preferably, the method includes the 
step of identifying the molecular subtype of the 'tumour 
sample, and using the discriminating set specific to the 
subtype . 

Additionally or alternatively, the method of the sixth 
aspect of the invention may include the steps of determining 
the expression levels of the discriminating set in a tumour 
sample, determining an expression profile from the 
expression levels and adding the profile to a database. 
Preferably, the molecular subtype of the tumour sample is 
also identified, and preferably added to the database. 

Standard profiles, characteristic of a particular class may 
be derived from at least two expression profiles of known 
class, wherein the expression profiles are derived from 
genes of the discriminating set. The standard profile is 
preferably specific to class and molecular subtype. 
Additionally or alternatively, expression profiles of known 
class (and, optionally, subtype) are added to the database. 

Addtionally, or alternatively, the method of the sixth 
aspect may further .include steps to check for a change in 
class of the tumour during treatment. In one embodiment, 
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expression profiles are provided from the tumour at 
different stages of treatment (e.g. start of treatment and 
end of treatment) and compared to determine a change in 
class, wherein the expression profiles are derived from the 
expression levels of ■ genes of the discriminating set. The 
expression profiles are preferably compared to standard 
and/or known profiles to determine the class. 

The classification according to molecular subtype is 
preferably performed using techniques, such as 
histopathological (eg- immunological) techniques or gene 
expression techniques, that directly measure levels of gene 
expression products in tumour samples. Gene expression 
techniques are most preferred. However, clinical techniques 
that are capable of accurately discriminating between 
molecular subtypes may also be used. 

The tumours are preferably breast tumours and the molecular 
subtype preferably corresponds to the ER (Estrogen Receptor) 
status of the tumour (e.g. ER+) . However, the method may be 
applied 'to other groups of tumours (e.g. lung tumours, 
ovarian tumours and lymphomas) and/or other molecular 
subtypes (e.g. germinal centre-like and activated B-cell 
like in diffuse large B-cell lymphomas) . Preferably the 
analysis performed on the class of expression profiles to 
determine the differentially expressed genes genes includes 
significant analysis of microarrays (SAM, ref . 12), which 
identifies genes whose expression levels vary significantly 
between samples under comparison. Preferably, the analysis 
involves statistical analysis, for example using Weighted 
Voting, Support Vector Machines and/or Hierarchical 
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clustering (see later for an explanation of these 
techniques) . 

In a seventh aspect of the invention, there is provided the 

set of genes derived by the sixth aspect of the invention. 

In an eighth aspect of the invention, there is provided the 
use of the discriminating set in assigning a tumour sample 
to a particular class. 

Aspects and embodiments of the present invention will now be 
illustrated, by way of example, with reference to the 
accompanying figures. Further • aspects and embodiments will 
be apparent to those skilled in the art. All documents 
mentioned in this text are incorporated herein by reference. 

Figure 1 shows clustering of sporadic breast tumors by 
global expression profiles a) Unsupervised hierarchical 
clustering of 98 breast tumors using the top 376 genes 
exhibiting the highest variation in gene expression, 

b) Principal component analysis (PCA) using the 376 gene 
set. Similar molecular groupings are observed as in a)., 

c) Hierarchical .clustering of samples using the SAM- 4 09 gene 
set, which consists of genes that are significantly 
regulated between tumor subtypes. Approximately two- thirds 
of the genes in the SAM-409 gene set exhibit increased 
expression in ER+ tumors. 

Figure 2 shows identification of an Expression Signature 
Correlated to the NPI (NPI-ES) : 

a) Determination of differentially expressed genes using a 
moving NPI threshold. Genes (y-axis) exhibiting significant 
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differential expression were identified at each threshold 
value (x-axis) . Using a threshold of 4 delivers the highest 
number of differentially regulated genes, 

b) Hierarchical clustering of ER+ samples using the NPI-ES. 
The red bar indicates samples of low NPI (< 4) ; while the 
blue bar indicates samples of high NPI 

c) Classification and prediction confidence of ER+ tumor 
samples using the NPI-ES. Samples are sorted by their NPI 
value (X-axis) . Weighted voting was used to classify the 
samples and the prediction strengths of each sample (Y-axis) 
calculated based' upon Golub et al . (13) . Sample 
classifications with a prediction strength of <0.3 are 
considered 'uncertain' or 'low-confidence' (grey area). 

Figure 3 shows KM Survival Analysis Comparing the Prognostic 
Strengths of Different Classification Schemes on ER+ Tumors. 
Green lines represent (a) low NPI , (b) low NPIES expression 
levels, or (c) low 'prognosis' signature (PES) expression 
levels, while pink lines represent high levels, (a) 49 
Rosetta ER+ Tumors stratified by classical NPI into 'good' 
prognosis (NPI<3.4) (35 tumors) and 'moderate' prognosis 

(NPI>3.4) (14 tumors) groups, (b) The same 49 Rosetta ER+ 
Tumors stratified by NPI-ES into group's expressing high (24 
tumors) vs low levels of the NPI-ES (25 tumors) . ' (c) The 
same 49 Rosetta ER+ Tumors stratified by the 70-gene 

^prognosis' signature into 'good prognosis' group (27 
tumors) vs 'poor prognosis' group (22 tumors) respectively. 

(d) The 46 Stanford ER+ Tumors stratified by NPI-ES into 
groups expressing high (13 tumors) vs low (33 tumors) levels 
of the NPI-ES . 
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Figure S3 shows classification and prediction confidence of 
tumor samples using the 44 -gene set based on all tumors 
regardless of subtype. 

Figure S8 shows hierarchical clustering of gene expression 
data from Rosetta data set. Top) Dendrogram displaying the 
similarities between tumors. The color-coded bar indicated 
the subtype to the corresponding gene signature. Left) The 
full cluster of 276. genes with three distinct gene clusters. 
Note that some ERBB2 tumors appeared to segregate with ER+ 
tumors (red bar) , but were identified as ERBB2+ upon close 
inspection of expression of ERBB2+- related genes (zoom up of: 
clustergram) . This is due to the Rosetta microarray 
possessing a much higher number of genes related to the ER+ 
subtype than the ERBB2 subtype . 

Figure S9 shows hierarchical clustering of Rosetta ER-t- 
samples (.49) based upon the expression level of the NPI-ES 
(46 matches found in Rosetta data out of 62 genes) . The 
color bar is as defined in Figure 2b. 

Figure S10 shows hierarchical clustering of Stanford breast 
tumors. Top) Dendrogram displaying the similarities between 
tumors. The color-coded bar indicated the subtype to the 
corresponding gene signature. Left) The full cluster of 136 
genes with three distinct gene cluster. 

Figure Sll shows hierarchical clustering of Stanford 46 ER+ 
samples using NPI-ES (31 matches out of 62 genes) . The color 
bar is defined as Figure 2b) . 
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Figure S12 shows the relationship between NPI-ES Expression 
and NPI Status in the ER- and ERBB2+ Molecular Subtypes. The 
NPI status of ER- and ERBB2 tumors is in general higher than 
ER+ tumors. Unlike the case for ER+ tumors, we were unable 
to identi'fy by SAM genes that were differentially regulated 
in high vs low NPI tumors for the ER- and ERBB2+ subtypes. 
Also, NPI-ES does not appear to be correlated as well to NPI 
values associated with the other molecular subtypes. 

Figure S13 shows 20 pairs of samples, obtained 1 Before' and 
'After' 14 weeks doxorubicin treatment (Perou et al M 2000). 
Of the 20 ^Before' samples, 10 samples exhibited high levels 
of NPI-ES expression (H) , and 10 exhibited low levels of 
expression (L) . Of the former 10 samples, 6 retained high 
levels of expression after chemotherapy (H -> H, depicted in 
Red), while 4 exhibited low levels of expression after 
treatment (H -> L, depicted in yellow) . 

Figure S14 shows a Kaplan-Meier Relapse-free survival 
analysis curve using the patients that contributed the 20 
samples of Figure S13. 

Materials and Methods 

Breast Tissues and Clinical Information 

Human breast tissues were obtained from the NCC Tissue 
Repository, after appropriate approvals from the NCC 
Repository and Ethics Committees. Histological confirmation 
of tumour status and Estrogen Receptor (ER) and ERBB2 
immunohistochemical status were provided by the Dept of 
Pathology at Singapore General Hospital (see Supplementary 
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Information for clinical information) . Samples contained at 
least 50% tumour content. NPI status was calculated as 
follows : tumour size (cm)*0.2 + grade + lymph node pts 
(negative nodes=l point; positive nodes, 1 to 3 positive=2 
points; positive nodes, 4 or more=3 points) . As tumour size 
in the Stanford data set was defined using the CAT system, 
we assigned an approximate value for each CAT grade (ie, 
Tl=2cm, T2=3.5, T3 = 5, T4=3 . 5) . 

Sample Preparation and Microarray Hybridization 

RNA was extracted from tissues using Trizol reagent and 
processed for Affymetrix Genechip hybridizations using U133A 
Genechips according to the manufacturer's instructions. 

Data Processing and Analysis 

Raw Genechip scans were quality controlled using Genedata 
Refiner and filtered by removing genes whose expression' was 
absent in all samples (ie 1 A' calls) . Expression values were 
subjected to a log2 transformation, and normalized by median 
centering all remaining genes by each sample. Data analysis 
was performed using Genedata Expressionist or conventional 
spreadsheet applications. The unsupervised dataset (Figure 
1, a-b) contains genes exhibiting a standard deviation (SD) 
of >1.5 across all well-measured samples. Minor variations 
of the variation filter used for gene selection also yielded 
very similar results (P. Tan, unpublished data) . Duplicate 
probes for the same gene were removed from analysis, leaving 
one probe per gene- Average -linkage hierarchical clustering 
was performed using CLUSTER and displayed by using TREEVIEW. 
Significance Analysis of Microarrays (SAM) (12) was 
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implemented to identify differentially regulated genes. 

% False discovery rates' were 0.1% for Figure 1c and 15% for 
Figure 2. Weighted Voting (WV) , Leave-one-out cross 
validation (LOOCV) assays, and prediction strengths (PS) 
were calculated as in Golub et al . , (13) (Supplementary 
Information) . Kaplan-Meier survival curves were created 
using SPSS , and log-rank tests used to calculate the 
statistical significance of differences between survival 
curves. Statistical associations between gene expression and 
clinical variables were determined by chi-square analysis. 

Descriptions of Weighted Voting (WV) and Leave -One -Out Cross 
Validation (LOOCV) Assays 

Weighted Voting (WV) : The weighted voting algorithm utilizes 
a signal-to-noise (S2N) metric to perform binary 
classifications. Each gene belonging to a predictor set is 
assigned a *vote' , expressed as the weighted difference 
between the gene expression level in the sample to be 
classified and the average class mean expression level. 
Weighting is determined using the correlation metric: 

P(g,c) = — — — (p and a denotes means and standard deviations 
<x, +<x 2 

of expression levels of the gene in each of the two 
classes) . The ultimate vote for a particular class 
assignment is computed by summing all weighted votes made by 
each gene used in the class discrimination. The "prediction 

strength" (PS) is defined as: ps = VmN ~ Vlose 

V -4- V 

r WIN ^ Y LOSE 

where V mN and V^se are the vote totals for the winning and 
losing classes, respectively. PS reflects the relative 
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margin of victory and hence provides a quantitative 
reflection of prediction certainty. 

Leave -One -Out Cross Validation (LOOCV) : We used a standard 
leave-one-out crossvalidation (LOOCV) approach to assess 
classification accuracy in the training set . In LOOCV, one 
sample in the training set is initially *left out' , and the 
classifier operations (eg gene selection and classifier - 
training) ' are performed on the remaining samples. The 'left 
out' sample is then classified using the trained algorithm, 
and this process is then repeated for all samples in the 
training set . 

Results and Discussion 

Defining Molecular Subtypes of Breast Cancer Using 
Unsupervised Clustering 

It iias been proposed tftat a signiticant proportion or cne 
intrinsic gene, expression variation in breast cancer can be 
attributed to different tumours belonging to distinct 
Molecular subtypes' (eg ER+ and ER- tumours) (8-9, 14). In 
an. initial analysis where tumours were treated irrespective 
of subtype, we could not convincingly identify an expression 
signature correlated to the NPI. We hypothesized that this 
might be due to dramatic differences in gene expression 
between subtypes (inter-subtype differences) potentially 
obscuring more subtle patterns of variation within subtypes 
(intra -subtype differences) . To circumvent this problem, we 
implemented a methodology where each molecular subtype was 
treated as an independent data set- Briefly, a variety of 
unsupervised clustering techniques were first used to 
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broadly segregate a set of breast tumour expression profiles 
according to their respective 'molecular subtype' 
categories. Second, tumours within each subtype were then 
independently analyzed to define expression signatures that 
might be correlated to the NPI or its constituent elements. 

Using Affymetrix U133A Genechips, we generated expression 
profiles for 98 sporadic breast tumours derived from our 
local predominantly Chinese patient population. After data 
normalization and pre-processing, we applied a standard 
deviation filter to identify a 367 gene set exhibiting a 
high degree of gene expression variation across the tumour 
series, and used this gene set to group the tumour 
expression profiles on the hasis of their overall similarity 
using unsupervised hierarchical clustering. The breast 
tumours self -segregated into three major subgroups, referred 
to as ER+, ER-, and ERBB2+ respectively (Figure la). This 
segregation pattern was confirmed using principal components 
analysis (PCA) , an independent analytical technique (Figure 
lb) , which delivered highly similar results. To robustly 
identify these groupings, we used SAM (12) to identify genes 
that were differentially expressed between the subtypes. At 
a FDR ('False Discovery Rate') of 0.1%, we identified 409 
genes that were significantly regulated in a subtype- 
specific manner. (Figure 1c) . 

The list of Table S5 represents the top 50 genes identified 
by SAM to be significantly regulated in each molecular 
subtype (ER+, ER-, ERBB2 + ) . The genes are ranked by their 
S2N correlation ratio, which reflects the extent of the 
expression perturbation observed among different groups. 
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There is good overlap between these genes and similar lists 
reported by other studies (ref . 8-11) 

Approximately 69% of the 409 gene set exhibited increased 
expression in the ER+ subgroup, including the estrogen 
receptor gene ESR1. and estrogen- regulated genes such as 
LIVl, TFF1, and MYB (Supplementary Information). In 
agreement with other studies, high expression levels of 
GATA3, HNF3a, Annexin A9 , and XBP1, were also observed in 
this subtype (8-9, 11) . The ER- subgroup was associated with 
high expression of basal mammary epithelia markers (keratin 
5 and 17), the basement membrane protein ladinin 1, the 
serine protease KLK5, which has been associated with poor 
disease prognosis, (15), and the serine protease inhibitor 
maspin, a tamoxifen- inducible gene that has been previously 
reported to be expressed in an inverse fashion to ER (16) . 
Finally, the ERBB2+. subtype was associated with high 
expression levels of the ERBB2 receptor and other genes 
physically linked to the 17q locus, such as GRB7 and PMNT 
(14) , suggesting the presence of DMA amplification. However, 
the majority of genes exhibiting increased expression 
specifically in the ERBB2+ subtype are not confined to the 
17q locus but are found throughout the genome, such as 
members of the S100 calcium-binding family (S100A8, A9) . 
Taken collectively , our results validate and confirm 
previous reports that the majority of breast tumours can 
indeed be subdivided into distinct molecular subtypes on the 
basis of their global gene expression profiles. 

Identification of a Prognostic Set Correlated to the NPI in 
ER+ Tumours 
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We focused on 34 tumours belonging to the ER+ molecular 
subtype and attempted to identify genes within this subtype 
whose expression might be correlated to NPI status. 
Classically, breast' cancer patients are typically stratified 
by the NPI into 3 major groups - % good' prognosis (NPI 
<3.4), ^moderate' prognosis (NPI 3.4 - 5.4), and % poor' 
prognosis (NPI > 5.4) (2) . Possibly reflecting the effects 
of variability across different scoring pathologists, other 
studies have proposed slightly different values for the cut- 
off values defining these groups (17) . To avoid any 
potential bias in determining the appropriate NPI cut-off 
value, we conducted a moving threshold analysis where the 
ER+ tumours were divided into a series of binary groups by a 
NPI threshold that was steadily increased from 2.3-7.8. At 
each threshold value, genes exhibiting significant variation 
in expression between the two groups were identified. We 
found that using an NPI cut-off value of 3.8 to 4.6 yielded 
a gene set of 62 differentially expressed genes (Figure 2a) , 
the majority of which exhibited increased expression in the 
ER+ samples with a high NPI (Figure 2b) . We refer to this 
62-member gene set as an *NPI Expression Signature' or NPI- 
ES, shown in Table S6 . The genes belonging to the NPI 
expression signature are associated with a wide variety of 
cellular functions implicated in oncogenesis, including- DNA 
replication and cell division (APRT , MCM4 , KNSL 1,CDC2), 
cellular signaling (chemokine ligand 1, Met, ShC) , apoptosis 
(survivin, CD27 binding protein) , and cellular adhesion 
(discs-large homolog 7, tetraspan 1) . Of the individual NPI 
components (tumour size, tumour grade, lymph node status) , 
tumour grade appears to represent the predominant 
contributor to the molecular makeup of the NPI-ES 
(Supplementary Information) . 



49 



WO 2005/033699 



PCT/GB2004/004195 



Classification of Tumours by the NPI-ES Defines Two Discrete 
Molecular Groups 

One proposed advantage in the use of molecular profiles for 
tumour classification is the ability to mathematically 
quantify the confidence level of the classification (11) , 
which is particularly important if the classification 
affects the subsequent course of treatment. In such a 
scenario, the treating physician can then weigh the 
confidence level of a prediction against the potential 
morbidity of a specific intervention. Notably, although the 
ER+ samples in our data set were associated with a 
continuous spectrum of classical NPI values (2 to 8) , the 
clustering analysis using the NPI-ES appeared to separate 
the ER+ tumours into two apparently discrete groups (Figure 
2b) , raising the' possibility that samples exhibiting 
.continuous values based upon histppathological parameters 
may be nevertheless separable into discrete categories at 
the molecular level . 

To better define the ability of the NPI-ES to confidently 
discriminate between these two classes, we used Weighted 
Voting (13) , a supervised learning algorithm, to distinguish 
between tumours exhibiting high and low expression of the 
NPI-ES, and tested the classification accuracy of the 
trained algorithm using an established leave-one-out cross 
validation (LOOCV) assay. In addition to classification 
accuracy, quantitative metrics (prediction strengths, PS) 
were also calculated as described in Golub et al., (13) to 
provide an assessment of the prediction confidence (Figure 
2c) . The WV analysis revealed that the NPI-ES delivered a 
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LOOCV classification accuracy of 91%, with 3 
misclassif ications. Of the 3 samples that were wrongly 
classified, 2 were associated with a low prediction strength 
(PS < 0.3), and thus represent 'low-confidence' or 
'uncertain' classifications. Indeed, of the 29 (out of 34) 
ER+ tumours associated with a 'high-confidence' 
classification (PS>0.3), only one sample was wrongly 
classified. These results suggest that the NPI-ES can be 
used to classify the majority of the ER+ tumours in our data 
set into discrete groups with high confidence. 

Derivation of a NPI Expression Si g nature Using All Tumors, 
Regardless of Subtype 

We defined the NPI-ES using a two-step methodology. 
Initially, unsupervised clustering was used to cluster 
tumors according to their respective 'molecular subtype' (ie 
ER+-, ER - , ERBB2 + ) . Tumors within each subtype were analyzed 
for expression signatures that might be correlated to the 
NPI. Here, we show that performing the first step 
(definition of distinct molecular subtypes) is important in 
the identification of the NPI-ES. 

We assembled a data set consisting of all 79 tumors, 
regardless of molecular subtype, and performed a moving NPI 
threshold analysis to define an 'appropriate' NPI threshold, 
as above (see Figure 2a) . We found that using an NPI 
threshold of 4 yielded a total of 44 differentially 
expressed genes: Of this 44 gene set, 16 (35%) also belong 
to the NPI-ES. (which was derived from ER+ samples) . 
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We used Weighted Voting (WV) and cross-validation (LOOCV) 
assays to assess this ability of this 44 gene set to 
confidently classify the tumor samples into discrete groups. 
As can be seen in Figure S3, the number of low -confidence 
(PS<0.3, red. area) samples, as well as the misclassif ication 
rate (9% for the 44 gene set) are both significantly 
increased compared to Figure 2c. This result indicates that 
the 44 -gene set, based upon all 79 tumors, is less effective 
in predicting the NPI status of a tumor than the NPI-ES on 
ER+ tumors . 

In Fig. S3 Samples are sorted by their NPI value (X-axis) . 
Weighted voting was used to classify the samples and the 
prediction strengths of each sample (Y-axis) calculated 
based upon Golub et al . , (13) . Sample classifications with a 
prediction strength of <0.3 are considered 'uncertain' or 
>low confidence' (grey area). A higher number of 'uncertain' 
• (low PS) samples, and misclassif ied samples are observed 
compared. to Figure 2c . 

The 44 gene set derived from all tumors regardless of 
subtype is also not- as effective as the NPI-ES at predicting 
NPI status in an independent data set. Using the Rosetta 
data set as a blinded test set, we applied the 44 gene set 
to the 4 9 ER+ tumors found in the Rosetta data set, and used 
Student's t-test to determine the significance of 
association between a ER+ tumors expressing high levels of 
the 44 gene set. and possessing a high NPI - We obtained a p- 
value of 0.2 9 for the 44 gene set, which was much less 
significant compared to a p-value of 0.0004 for the NPI-ES. 
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Interestingly, the NPI-ES, despite being derived from an 
analysis of ER+ tumors, outperforms the 44 gene set even 
when applied across all 78 tumors in the Rosetta data set. 
To illustrate this, the 78 Rosetta tumors were divided into 
two groups of NPI<3.4 (good prognosis) and >3 . 4 respectively 
(moderate prognosis) . Weighted voting was then 
used to classify the Rosetta tumors by the NPI-ES or the 44 
gene set. As can be seen in Table S3, the NPI-ES delivered a 
classification accuracy of 80%, compared to the 44 
gene set which delivered a 70% classification accuracy. 

Genes associated with histological grad e (1 & 2 vs. 3) 

Since the classical NPI is a composite metric derived from 
tumor grade, tumor size, and lymph node status, we defined 
the contributions made by each of these individual elements 
to the molecular makeup of the NPI-ES. Using SAM to identify 
genes correlated, to each of the three histopat.hological 
variables, we were unable to convincingly identify genes 
whose expression was significantly correlated to either 
tumor size or lymph node status. In contrast, in the case of 
histological grade, a significant number of genes were found 
to be differentially expressed between grade 1 or 2 and 
grade 3 tumors, and the genes in this grade-correlated gene 
set exhibited substantial overlap (66%) with the NPI-ES 
(Table S6). These results suggest that tumors exhibiting 
different histological grades may be biologically distinct, 
and that tumor grade is a key contributor to the NPI 
expression signature, with the remaining two parameters 
(tumor size and lymph node status) delivering comparatively 

loader- rnnf-ri hutions . 
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Application of the NPI-ES Across Multiple Independent Breast 
Cancer Expression Data Sets 

To test the ability of the NPI-ES to predict both NPI status 
and disease prognosis in a series of blind 'test sets' , we 
used two independent breast cancer data sets that were 
publicly available. The first data set (referred to as the 
Rosetta data set) consists of .78 lymph-node negative breast 
tumours profiled using oligonucleotide-based microarrays, 
and also contains the duration of * disease free survival' 
(DFS) (the time from initial tumour diagnosis to the 
appearance of a new distant metastasis) for each patient 
(10) . Importantly, .several studies have previously shown the 
NPI to be of prognostic value even in node-negative breast 
cancers (18, 19) . The second data set consists of 78 breast 
carcinomas profiled using cDNA microarrays with overall 
patient survival information (referred to as the Stanford 
data set) (14) . The availability, of these data sets allowed 
us to independently test the predictive power of the NPI-ES, 
as the Rosetta and Stanford data sets are different from our 
data set in multiple ways, including i) patient population, 
II) sample handling protocols, III) scoring pathologist and 
IV) choice of array technology and probe sets (two-color in 
the Rosetta and Stanford data sets and single color in 
ours) . 

Rosetta Breast Cancer Data Set: Of the 409 genes identified 
by SAM analysis defining the ER+ , ER- , and ERBB2+ subtypes, 
276 genes (67%) were found on the Rosetta microarray. We 
applied this gene set to the 78 Rosetta tumour profiles and 
identified 4 9 tumours belonqinq to the ER+ molecular subtype 
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determined that .4 6 put of 62 genes belonging to the NPIES 
were also present on the Rosetta microarray. Since the 
Rosetta data set is based upon a different array technology 
from ours, it is not possible to directly apply the trained 
Weighted Voting model developed on our data set to classify 
the Rosetta tumours. 

However, following the strategy described in Ramaswamy et 
al . , (20) for the comparison of gene sets across different 
array technologies, we used hierarchical clustering to group 
the 49 ER+ Rosetta tumours using the overlapping NPI-ES set 
of 4 6 genes. The clustering analysis divided the 4 9 ER+ 
Rosetta tumours into 2 groups consisting of 24 and 25 
tumours exhibiting 'high' and % low' expression levels of the 
NPI-ES respectively (see Figure S9) . 

We compared the tumours in these two subgroups to determine 
if they were associated .with differences in their NPI 
values. Using two distinct statistical approaches where the 
tumour NPI values were treated either as a continuous 
gradient (Student's T-test) , or as two discrete groups (Chi- 
square analysis, using classical NPI cut-off value of 3.4), 
tumours exhibiting high expression of the NPI-ES 
consistently exhibited with a significantly higher NPI value 
compared to tumours expressing low levels of the NPI-ES 
(p=0.0004 for continuous analysis, p=0.0087 for binary 
analysis) (Table la) . This analysis indicates that 
expression of the NPI-ES is significantly correlated with 
classical NPI status in ER+ tumours even in an independent 
data set generated by a different array technology. 
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To compare the prognostic power of the NPI-ES to the 
classical NPI system of staging, odds-ratio calculations 
were performed (Table lb) . Patients with ER+ tumours 
expressing high levels of the NPI-ES had an odds-ratio for 
distant metastases within five years of 10.3 (95% CI 2.4 to 
44.0, p<0.001) compared to ER+ tumours expressing low levels 
of the NPI-ES. In comparison, patients with ER+ tumours with 
a classical NPI index of >3 . 4 (^moderate' prognosis) had a 
lower odds-ratio for distant metastases of 6 . 1 (95% CI 1.6- 
23.4, p=0.06) compared to ER+ tumours with a NPI index of 
<3.4 ( 'good' prognosis) . We also compared the prognostic 
performance of the NPI-ES and NPI using Kaplan-Meier 
survival analysis (Figure 3) . In agreement with other 
studies, patients with tumours of low NPI (<3.4) exhibited 
better DFS as compared to patients of higher NPI (>3.4) 
(p=0.007, Figure 3a). When this same population was 
restratified by the NPI-ES, patients with tumours exhibiting 
high expression of the NPI-ES exhibited better relapse-free 
survival (p=0.0007) compared to patients with tumours 
expressing low levels of the NPI-ES. Taken collectively, 
this data suggests that for ER+ tumours, the prognostic 
power of the NPI expression signature may outperform the 
classical NPI system of staging. 

Stanford Data Set: A similar approach was used to test the 
NPI-ES on the Stanford data set (see Fig. S10) . Of the SAM- 
409 gene set used to define the ER+, ER- , and ERBB2+ 
subtypes, 136 genes were found on the Stanford microarray 
[http://genome-www5.stanford.edu/MicroAn-ay/SMD/), and these genes were 
used to cluster the Stanford tumours to identify 4 6 tumours 
belonging to the ER+ molecular subtype (from 72 tumors after 
discarding the normal -like tumor subgroup of 6 tumors, which 
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subgroup is likely to be due to the presence of 
contaminating non-malignant tissue) . 

These 4 6 tumours were then clustered (see Fig. Sll) using 
the NPI-ES (31 matches on the Stanford microarray) into 
'high-NPI-ES' (13 tumours) and 'low-NPI-ES' groups (33 
tumours) . Once again, Student's t-test revealed a 
significant association (p=0.001) between the high and low 
expressing NPI-ES subgroups and classical NPI status (Table 
la). In addition, a KM survival analysis also demonstrated a 
significant (p=0. 04-93) overall survival advantage in 
patients with low-NPI-ES expressing tumours compared to 
patients with high-NPI-ES expressing tumours (Figure 3d) . 

Interestingly, there appears to be a strong correlation 
between ER+ tumours expressing high levels of the NPI-ES and 
the 'Luminal C molecular subtype identified in Sorlie et 
al . . (14) . althouah none of the 62 genes belonging to the. 
NPI-ES have been' reported to be expressed in the latter, 
interestingly,. Sorlie et al . , (ref . 14), previously reported 
the identification of a "Luminal C" subtype based upon an 
^intrinsic' set of 500 genes. There appears to be a strong 
overlap (96%) between 'Luminal C* tumors and tumors 
expressing high levels of the NPI-ES , although, as mentioned 
above, none of the 62 genes belonging to the NPI-ES are 
found in this 'intrinsic' set. This is illustrated in Table 
Sll. 

The Prognostic Capacity of the NPI-ES is Comparable to a 
Previously Described "Prognosis Signature" for B reast Cancer 
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In the same study by Van Veer et al (10) , the authors also 
identified a 70-gene ^prognosis' expression signature (PES) 
that predicted the DFS status of breast tumours. 
Interestingly, there is minimal overlap between the genes 
belonging to the NPI-ES and the PES, as only one gene is 
found in common between the two. To compare the prognostic 
performance of the .NPI-ES and the PES on the Rosetta ER+ 
tumours, we used KM survival analysis to compare the DFS of 
patients stratified either by the NPI-ES (Figure 3b) or the 
PES (Figure 3c) . A slightly better performance was observed 
with the PES (p=0.0001) compared to the NPI-ES (p=0.0007). 
The marginal improvement associated with the PES, however, 
is not unexpected since the identification of the PES was 
directly based upon the expression profiles and clinical 
information of these same tumours. As such, the Rosetta 
tumours are not ^blinded' to the PES , while in the case of 
the NPI-ES, the Rosetta tumours represent a true independent 
test set. Indeed, when the PES and NPI-ES were applied to 
the Stanford ER+ tumours, both molecular signatures, 
delivered highly similar odds-ratios (3.9 for PES vs 4.17 
for NPI-ES) for relapse within 5 years (Table lc) . Thus, 
these results suggest that the prognostic power of the NPI- 
ES and PES are relatively comparable. 

Expression of the NPI-ES Molecular Signature Predicts 
Chemotherapy Response 

in this analysis, we examined the expression of the NPI-ES 
molecular signature in paired breast tumor samples before 
and after chemotherapy, and correlated the expression of 
this signature to eventual clinical response. 
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A publicly available breast cancer data set ("Stanford") was 
utilized, consisting of 20 pairs of samples, obtained 
^Before' and 'After' 14 weeks doxorubicin treatment (8) . Of 
the 62 genes found in the NPI-ES, 31 genes were also found 
on the Stanford microarray, and the expression of the 31 
gene set was examined in the paired samples. 

Of the 20 x Before' samples, 10 samples exhibited high levels 
of NPI-ES expression (H) , and 10 exhibited low levels of 
expression (L) . As shown in Figure S13, of- the former 10 
samples, 6 retained .high levels of expression after 
chemotherapy (H -> H, depicted in Red), while 4 exhibited 
low levels of expression after treatment (H -> L, depicted 
in yellow) . The number of deaths (after 5 years) was then 
tabulated for each group as shown in Table S12. 

A Kaplan-Meier Relapse-free survival analysis was then 
performed, and is shown in Figure S14 . We found that the * H- 
>L' tumors had the best survival outcome (p=0.022) compared 
to the other groups, while *H->H tumors had the worse 
prognosis. This result suggests that down- regulation of the 
NPI-ES in high-expression NPI-ES tumors can be taken as a 
marker of chemotherapy response. 

In summary, we. have identified a 62-gene expression 
signature that can potentially function as a molecular 
surrogate for the NPI. Confidence in the reliability 
of the NPI-ES was obtained by showing that it could predict 
both NPI status and disease prognosis for two independent 
sets of tumours generated by different centers. One 
interesting concept emerging from this study is that samples 
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exhibiting apparently continuous variables at the 
histopathological level may nevertheless be separable into 
discrete categories at the molecular level. This may address 
a major challenge in cancer histopathology , namely the 
difficultly of defining clinically appropriate cut-off 
values when the parameter being scored is of a continuous 
nature. We conclude by acknowledging that more work needs to 
be performed before the clinical utility of the NPI-ES can 
be fully assessed. First, the predictive power of the NPI-ES 
obviously needs to be tested against a much larger group of 
tumours . 

Second, although we have demonstrated the applicability of 
the NPI-ES in the ER+ molecular subtype, expression of the 
NPI-ES does not appear to be correlated as well to NPI 
values associated with the other molecular subtypes (ER-, 
ERBB2+) (Supplementary Information) . 

Sample Data 

Table S14 shows expression data for the prognostic set (or 
NPI-ES) of genes across samples of differing NPI value. The 
data are specific for the Affymetrix U13 3A genechip and have 
been through data preprocess. The gene expression profiles 
of the prognostic set can be used as training data to build 
a predictive model (eg, WV and SVM) , which then can assign 
the NPI class of an unknown tumour. 

The data is tab delimited, and has the following format: 
Columns : 

1st column: Probe_ID of prognostic set genes 
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2nd column: Gene Name 

3rd and other columns: gene expression data 
Rows : 

1st- row: Sample Ids (35 samples) 
2nd row: NPI index. 

3rd and other rows : gene expression data 

The gene expression data is derived as described in the 
•Sample Preparation and Microarray Hybridization 1 and 'Data 
Preprocessing 1 (see Materials and Methods section) . In 
particular, raw gene expression data values are calculated 
by the instrument used to measure the microarray (usually a 
microarray scanner, e.g. Affymetrix) . 

Table S15 shows the mean (p) and standard deviation (a) 
parameters for use in a Weighted Voting algorithm for each 
gene of the prognostic set in each class. These data could 
be used to assign the prognosis of an unknown breast tumour 
sample given a set of expression levels for genes of the 
prognostic set. The data is specific to Weighted Voting 
techniques applied to expression data from Affymetrix U133A 
genechip . 
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Table la) Association of NPI-ES Expression and NPI status in Rosetta and Stanford ER+ 
tumors. The 1 st column represents the number of tumors expressing high or low levels of 
the NPI-ES. 





Student's t-test (continuous) 


Xhi-square (binary) 


Rosetta 


mean(variance) 


p=0.0004 


Low (<3.4) 


High 


p=0.0087 


High (24*) 


3.1±0.4 




13 


11 




Low (25) 


2.3±0.6 




22 


3 




Stanford 




P=0.001 








High (13) 


5.3±0.5 










Low (33) 


4.5±0.6 











*Figure in parenthesis represents the no. of samples. 



Table lb) Odds ratio for distant metastasis within five years as a first event in Rosetta 
ER+ Tumors based upon classical NPI staging and NPI-ES expression 





ER+ Tumors 


Odds Ratio* 




Free>5 YR 


<5 Yr 


(95% CI) 


NPI (p=0.06) 






6.08 (1.58-23.39) 


Low(<3.4) 


27 


8 




High (>=3.4) 


5 


9 




NPI-ES (p<0.001) 






10.27 (2.40-43.94) 


Low 


22 


3 




High 


10 


14 





♦Odd ratios were calculated using a Stamford two-by-two table. CI slarids for "confidence interval". 



Table lc) Odds ratio for relapse within five years as a first event in Stanford ER+ Tumors 
based upon PES expression and NPI-ES expression. One sample did not possess relapse 
information and was removed from analysis (leaving 45 ER+ tumors). 





ER+ Tumors 


Odds Ratio 




Free 


Relapse 


(95% CI) 


PES (p=0.053) 
Low 


26 


8 


3.90 (0.94-16.25) 


High 

NPI-ES (p=0.040) 

Low 

High 


5 

25 
6 


6 

7 
7 


4.17(1.05-16.48) 
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Table S1 . Histopathology of Breast Tumors* 

Age Size (mm) Grade Node NPI ER PR Subtype LVj DCIS 

ER+ 



2000220 


52 


60 


3 


30 of 34 


7.2 


pos neg 


ductal 


yes 


minimal 


980278 


64 


40 


3 


14 of 20 


6.8 


pos neg 


ductal/ 


yes 


minimal 














mlcropap 






2000597 


57 


40 


2 


Oof 12 


3.8 


pos neg 


ductal 


possible 


extensive 


2000609 


62 


70 


2 


17of 17 


6.4 


pos pos 


ductal 


yes 


none 


20020071 


58 


28 


.3 


Oof 16 


4.56 


pos pos 


ductal 


no 


none 


20020160 


86 


120 


3 


Oof 10 


6.4 


pos pos 


lobular 


no 


none 


2000787 


57 


60 


3 


Oof 9 


5.2 


pos pos 


ductal 


yes 


none 


2000818 


52 


10 


2 


Oof 11 


3.2 


pos neg 


ductal 


no 


minimal 


20020051 


38 


50 


3 


1 of 25 


6 


pos pos 


ductal 


no 


none 


20020056 


71 


20 


1 


2 of 17 


3.4 


pos neg 


ductal 


no 


minimal 


980197 


55 


30 


3 


2 of 4 


5.6 


pos pos 


ductal 


yes 


minimal 


980261 


60 


15 


2 


Oof 9 


3.3 


pos neg 


ductal 


no 


minimal 


980391 


56 


20 


2 


Oof 7 


3.4 


pos pos 


ductal 


no 


none 


2000768 


39 


40 


3 


Oof 17 


4.8 


pos pos 


ductal 


no 


minimal 


2000779 


48 


55 


3 


0 of 14 


5.1 


pos neg 


ductal 


no 


none 


990123 


54 


55 


3 


7 of 11 


7.1 


pos pos 


ductal 


no 


none 


2000422 


51 


63 


3 


3 of 7 


6.26 


pos pos 


ductal 


no 


minimal 


2000683 


72 


35 


2 


Oof 17 


3.7 


pos pos 


ductal 


no 


minimal 


2000775 


51 


25 


2 


Oof 12 


3.5 


pos neg 


ductal 


no 


minim at 


2000804 


39 


40 


3 


5 of 21 


6.8 


pos pos 


ductal 


yes 


minimal 


980346 


52 


"20 


' 3 


'6 of 4" 


4.4 


pos pos 


ductal 


possible 


minimal 


980383 


64 


30 


2 


0 of 16 


3.6 


pos pos 


ductal 


no 


minimal 


990082 


49 


34 


2 


3 of 16 


4.68 


pos pos 


ductal 


no 


minimal 


980177 


75 


26 


2 


6 of 13 


5.52 


pos pos 


ductal 


yes 


none 


980178 


69 


32 


3 


2 of 15 


5.74 


pos neg 


ductal 


no 


minimal 


980403 


73 


30 


3 


Oof 9 


4.6 


pos pos 


ductal 


possible 


minimal 


980434 


73 


30 


3 


Oof 16 


4.6 


pos pos 


ductal 


no 


minimal 


990075 


66 


25 • 


3 


5 of 21 


6.5 


pos pos 


ductal 


yes 


none 


990113 


70 


90 


3 


1 1 of 1 5 


7.8 


pos pos 


ductal 


no 


minimal 


990107 


50 


40 


1 


1 of 18 


3.8 


pos neg 


tub-mixed 


yes 


minimal 


980208 


42 


25 


3 


5 of 20 


6.5 


pos pos 


ductal 


no 


none 


980220 


40 


37 


2 


Oof 5 


3.74 pos pos 


ductal 


yes 


minimal 


980221 


33 


65 


3 


1 of 13 


6.3 


pos pos 


ductal 


no 


none 


990375 


38 


15 


1 


Oof 10 


2.3 


pos neg 


ductal 


no 


extensive 



ER- 



980193 


49 


25 


3 


3 of 23 


980216 


65 


45 


2 


5 of 20 


980256 


46 


36 


3 


1 of 12 


980285 


49 


40 


3 


1 of 7 


980338 


55 


30 


3 


Oof 7 



5.5 


neg neg 


ductal 


no 


minimal 


5.9 


neg neg 


ductal 


no 


none 


5.72 


neg neg 


ductal 


no 


none 


5.8 


neg neg 


ductal 


yes 


minimal 


4.6 


neg neg 


ductal 


no 


none 



67 



WO 2005/033699 



PCT/GB2004/004195 



980353 


58 


45 


3 


Oof 25 


4.9 


neg neg metaplastic 


no 


none 


98041 1 


69 


30 


2 


0 Of 9 


3.6 


neg neg 


ductal 


no 


none 


980441 


66 


30 


3 


4ot 14 


6.6 


neg neg 


ductal 


yes 


none 


990174 


55 


45 


2 


3 of 24 


5.9 


neg neg 


ductal 


yes 


minimal 


2000320 


67 


20 


3 


20 of 21 


6.4 


neg neg 


ductal 


yes 


none 


2000500 


44 


.75 


3 


6 of 6 


7.5 


neg neg 


ductal 


yes 


none 


980247 


35 


45 


3 


1 of 19 


5.9 


neg neg 


ductal 


yes 


minimal 


990299 


58 


55 


3 


7 of 17 


7.1 


neg neg 


ductal 


possible 


minimal 


2000593 


60 


41 


3 


Oof 15 


4.82 


neg neg 


ductal 


no 


none 


2000638 


60 


40 


1 


Oof 15 


2.8 


pos neg 


lobular 


no 


none 


2000731 


68 


51 


3 


1 of 29 


6.02 


pos neg 


ductal 


no 


minimal 


2000880 


55 


15 


2 


Oof 26 


3.3 


neg neg 


ductal 


no 


none 


RBB2 




















QR0194 


58 


50 


3 


25 of 32 


7 


neg neg 


ductal 


yes 


none 


980214 


49 


60 


2 


5 of 13 


6.2 


pos neg 


ductal 


no 


extensive 


980238 

www KJ 


62 


20 


3 


7 of 21 


6.4 


neg neg 


ductal 


no 


extensive 


980288 


45 


60 


3 


13of 15 


7.2 


pos neg 


ductal 


yes 


extensive 




33 


3 


3 


3 of 7 


5.06 


neg neg 


ductal 


yes 


extensive 


J DUO/ O 


77 


30 


3 


0 of 14 


4.6 


neg neg 


ductal 


no 


minimal 


aOUOOU 


56 






0 of 6 




neg neg 










68 


30 


3 


1 of 10 


5.6 


neg neg 


ductal 


yes 


none 




66 


35 


3 


1 0 of 1 2 


6.7 


neg neg 


ductal 


yes 


extensive 


990 115 


38 


28 • 


3 


9 of 10 


6.56 


pos pos 


ductal 


yes 


extensive 


990134 


43 


40 


3 


Oof 19 


4.8 


neg neg 


ductal 


no 


none 


990148 


60 


40 


2 


6 of 19 


5.8 


pos neg 


ductal 


yes 


minimal 


990223 


52 


5 


3 


1 of 21 


5.1 


pos neg 


ductal 


no 


extensive 


•2000104 


59 








--- 


pos "neg 


"ductal 






20O01 71 


50 


25 


2 


Oof 9 


3.5 


neg neg 


ductal 


no 


none 


2000209 


58 


50 


3 


Oof 7 


5 


pos neg 


• ductal 


no 


none 


2000210 


50 


40 


3 


3 of 6 


5.8 


neg neg 


ductal 


yes 


none 


2000237 


43 


47 


3 


23 of 40 


6.94 


pos pos 


ductal 


yes 


minimal 


2000287 


53 


40 


3 


Oof 8 


4.8 


neg neg 


ductal 


possible 


none 


2000399 


44 


40 


2 


Oof 8 


3.8 


neg neg 


ductal 


no 


minimal 


2000641 


47 


60 


3 


16 of 24 


5.2 


neg neg 


ductal 


yes 


' minimal 


2000652 


5G 


25 


3 


6 of 21 


6.5 


neg neg 


ductal 


no 


minimal 


2000675 


78 


55 


3 


16 of 16 


7.1 


neg neg 


ductal 


yes 


minimal 


2000709 


45 


30 


3 


Oof 16 


4.6 


neg neg 


ductal 


no 


none 


2000759 


57 


7 


3 


Oof 12 


4.14 


neg neg 


ductal 


no 


extensive 


2000813 


60 


. 23 


3 


16 of 17 6.46 neg neg 


ductal 


yes 


extensive 


2000829 
20020090 


51 


45 


2 


10 of 10 


5.9 


neg neg 


ductal 


yes 


extensive 


60 


45 


3 


19 of 27 


6.9 


neg neg 


ductal 


yes 


minimal 



* This list contains clinical information for 79 out of 98 tumors used in this study. 
Clinical information for the remaining 19 tumors was incomplete and not included in 
list. Only the 79 samples with complete clinical information was used for subsequent 
NPI-BS analysis. 
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Table S3, the NPI-ES delivered a classification accuracy of 80%, compared to the 44 
gene set which delivered a 70% classification accuracy. 



Table S3 : Classification accuracy of the NPI-ES or 44 gene set on 78 Rosetta 

Tumors 





NPI classification (<3.4 or >3.4) 




No. of misclassifications (Accuracy) 


44 Genes 


23 (70%) 


NPI-ES 


15 (80%) 
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Table S5 : List of ton 50 Significantly Regulated Genes in ER+. ER- 

and ERBB2+ Molecular Subtypes 

This list represents the top 50 genes identified by SAM to be significantly regulated in 
each molecular subtype (ER+, ER-, ERBB2+). The genes are ranked by their S2N 
correlation ratio, which reflects the extent of the expression perturbation observed among 
different groups. There is good overlap between these genes and similar lists reported by 
other studies (ref. 8-11) (main text). 



Gene description 



Unigene Chromosome 



ER+ Molecular Subtype 
estrogen receptor 1 
GATA binding protein 3 
annexin A9 
KIAA0882 protein 
carbonic anhydrase XII 

cytochrome P450, subfamily IIB (phenobarbital-inducible), polypeptide 
6 

dynein, axonemal, light intermediate polypeptide 1 

sema domain, immunoglobulin domain (Ig), short basic domain, 

secreted, (semaphorin) 3B 

N-acetyltransferase 1 (arylamine N-acetyltransferase) 
serine (or cysteine) proteinase inhibitor, clade A (alpha-1 
antiproteinase, antitrypsin), member 5 
cytochrome c oxidase subunit Vic 

Homo sapiens mRNA; cDNA DKFZp564F053 (from clone 

QKFZp564F053)rmRNA sequence 

LIV-1 protein, estrogen regulated 

troponin T1 , skeletal, slow 

hypothetical protein FLJ20151 

calsyntenin 2 

B-cell CLL/lymphoma 2 

guanidinoacetate N-methyltransferase 

microtubule-associated protein tau 

hypothetical protein FLJ12910 

WW domain-containing protein 1 

UDP-glucose ceramide glucosyltransferase 

GREB1 protein 

RNB6 

Human insulin-like growth factor 1 receptor mRNA, 3' sequence, 
mRNA sequence 

Interleukin 6 signal transducer (gp130, oncostatln M receptor) 

LAG1 longevity assurance homolog 2 (S. cerevisiae) 

cadherin, EGF LAG seven-pass G-type receptor 2 (flamingo homolog, 

Drosophfla) 

paired basic amino acid cleaving system 4 
regulator of G-protein signalling 1 1 



Hs.1657 Chr.6q25.1 
Hs.169946 Chr:10p15 
Hs.279928 Chr:1q2t 
Hs.90419 Chr:4q31.1 
Hs.5338 Chr:15q22 

Hs.1360 Chr:19q13.2 
Hs.406050 Chr:1p35.1 

Hs.82222 Chr:3p21.3 
Hs.155956 Chr:8p23.1-p21.3 

Hs.76353 Chr:14q32.1 
Hs.351875 Chr:8q22-q23 



Hs.-7 1968 

Hs.79136 

Hs.73980 

Hs.279916 

Hs. 12079 

Hs.79241 

Hs.81131 

Hs.101174 

Hs. 15929 

Hs. 355977 

Hs.432605 

Hs. 193914 

Hs.241471 



Chr:18q12.1 

Chr:19q13.4 

Chr:15q21.3 

Chr:3q23-q24 

Chr:18q21.3 

Chr:19p13.3 

Chr:17q21.1 

Ctir:6q25.1 

Chr:8q21 

Chr:9q31 

Chr:2p25.1 

Chr:14q32.32 



Hs.405998 — 
Hs.82065 Chr:5q1 1 
Hs.285976 Chr:1q21.2 

Hs.57652 Chr:1p21 
Hs.170414 Chr.15q26 
Hs.65756 Chr:16p13.3 
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UDP-glucose ceramide glucosyftransferase 
NPD009 protein 

v-myb myeloblastosis viral oncogene homolog (avian) 
interleukin 6 signal transducer (gp130, oncostatin M receptor) 
discs, large (Drosophila) homolog 5 
Homo sapiens mRNA; cDNA DKFZp434E082 (from clone 
DKF2p434E082) f mRNA sequence 

cytochrome P450, subfamily MB (phenobarbital-lnducible), polypeptide 
7 

HSPC009 protein 
KIAA1025 protein 

protein tyrosine phosphatase type IVA, member 2 
CGI-49 protein 

chromosome 20 open reading frame 35 
phorbol-12-myristate-13-acetate-induced protein 1 
KIAA0876 protein 
hypothetical protein FLJ20152 
hypothetical protein FLJ22318 

trefoil factor 1 (breast cancer, estrogen-inducible sequence expressed 
in) - 

polymerase (DNA-directed)«, delta. 4 
putative proline 4-hydroxylase 

GONF family receptor alpha 1 



Hs.432605 Chr:9q31 
Hs.283675 Chr:16p13.2 



Hs.1334 
Hs.82065 



Chr:6q22-q23 
Chr:5q1 1 



Hs.1 70290 Chr:10q23 
Hs.432587 — 



Hs.330780 

Hs. 16059 

Hs.4084 

Hs.82911 

Hs.238126 

Hs.256086 

Hs.96 

Hs.301011 

Hs.82273 

Hs.22753 

Hs .350470 
Hs.82520 



Chr:19q13.2 

Chr:17q21 

Chr:12q24.22 

Chr:1p35 

Chr:1q44 

Chr:20q13.11 

Chr:18q21.31 

Chr:19p13.3 

Chr:5p15.1 

Chr:5q35.3 

Chr:21q22.3 
Chr:11q13 



Hs.348198 Chr:3p21.31 
Hs, 105445 Chr:10q26 



ERBB2+ Molecular Subtype 

chloride channel, calcium activated, family member 2 

_Vrert>b2 erythroblastic leukemia viral oncogene homolog 2, 

neuro/glioblastoma derived oncogene homolog (avian) 

growth factor receptor-bound protein 7 

dual specificity phosphatase 6 

START domain containing 3 

transient receptor potential cation channel, subfamily V, member 6 

S100 calcium binding protein A8 (calgranulin A) 

protein phosphatase 1, regulatory (inhibitor) subunit 1A 

fibroblast growth factor receptor 4 

SRY (sex determining region Y)-box 1 1 

Unknown protein [Homo sapiens], mRNA sequence 

transducin-like enhancer of split 1 (E(sp1) homolog, Drosophila) 

hypothetical gene MGC9753 

mitogen-activated protein kinase kinase kinase 5 

KIAA1 102 protein 

fatty acid hydroxylase 

transcription factor AP-2 beta (activating enhancer binding protein 2 
beta) 

S100 calcium binding protein A9 (calgranulin B) 
fatty-acid-Coenzyme A ligase, long-chain 2 
hypothetical protein FLJ22671 

kynurenlne 3-monooxygenase (kynurenlne 3-hydroxylase) 



Hs.241551 

Hs. 3239 10 
Hs.86859 
Hs. 180383 
Hs.77628 
Hs. 302740 
Hs. 100000 
Hs .76780 
Hs. 165950 
Hs.32964 
Hs. 106642 
Hs.28935 
Hs.91668 
Hs.151988 
Hs. 202949 
Hs.249163 

Hs.33102. 
Hs.1 12405 
Hs.1 54890 
Hs. 193745 
Hs.1 073 18 



Chr:1p31-p22 
Chr:17q11.2- 
q12 

Chr:17q21.1 

Chr:12q22-q23 

Chr:17q11-q12 

Chr:7q33-q34 

Chr:1q21 

Chr:12q13.13 

Chr:5q35.1-qter 

Chr:2p25 

Chr:9q21.32 
Chr:17q21.1 
Chr:6q22.33 
Chr:4p1 3 
Chr:16q23 

Chr:6p1 2 

Chr:1q21 

Chr:4q34-q35 

Chr:2q37.3 

Chr:1q42-q44 
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KIAA0644 gene product 
aspartate beta-hydroxylase 

electron-transfer-flavoprotein, alpha polypeptide (glutaric aciduria I!) 
secretory leukocyte protease inhibitor (antileukoproteinase) 
isocitrate dehydrogenase 1 (NADP+), soiubie 
phenylethanolamine N-methyltransferase 
hypothetical protein FLJ1 41 46 

fucosyltransferase 3 (galactoside 3(4)-L-fucosyltransferase, Lewis 

blood group included) 

keratin, hair, basic, 1 

PDZ domain containing 2 

argininosuccinate synthetase 

specific granule protein (28 kDa) 

Homo sapiens cDNA: FLJ21521 fis, clone COL05880, mRNA 
sequence 

kynureninase (L-kynurenine hydrolase) 
hypothetical protein FLJ20539 
proline dehydrogenase (oxidase) 1 

v-myc myelocytomatosis viral related oncogene, neuroblastoma 
derived (avian) 
integrin, beta 6 

hypothetical protein MGC3077 

uncoupling protein 2 (mitochondrial, proton carrier) 

myosin X 

keratin 7 

steroid sulfatase (microsomal), arylsulfatase C, isozyme S 

formin homology 2 domain containing t 

ATP-binding cassette 7 sub-family-C-(GCTR/MRP) r member-3- 

chondroitin betal ,4 N-acetylg.alactosaminyltransferase 

KIAA0485 protein 

kraken-like . 

collagen, type XIII, alpha 1 „ 



Hs.21572 

Hs.283664 

Hs. 16991 9 

Hs.251754 

Hs.11223 

Hs.1892 

Hs.1 03395 

Hs. 169238 
Hs .32952 
Hs.173035 
Hs. 160786 
Hs.54431 

Hs.306777 
Hs.169139 
Hs.1 18552 
Hs.343874 

Hs.25960 

Hs.57664 

Hs.433404 

Hs.80658 

Hs.61638 

Hs.23881 

Hs.79876 

Hs.95231 

Hs.90786 

Hs. 11260 

Hs.89121 

Hs.301947 

Hs.211933 



Chr:7p15.1 

Chr:8q12.1 

Chr:15q23-q25 

Chr:20q12 

Chr:2q33.3 

Chr:17q21-q22 

Chr:1q42.11 

Chr:19p13.3 

Chr:12q13 

Chr:5p13.3 

Chr:9q34.1 

Chr:6p12.3 



Chr:2q22.1 

Chr:11q12.1 

Chr:22q11.21 

Chr:2p24.1 

Chr:2q24.2 

Chr:7p15-p14 

Chr:11q13 

Chr:5p15.1-p14.3 

Chr:12q12-q21 

Chr:Xp22.32 

Chr:16q22 

Chr:1 7q22 

Chr:8p21.3 

Chr:22q13 
Chr:10q22 



ER- Molecular Subtype 

keratin 16 (focal non-epidermolytic palmoplantar keratoderma) 
gamma-aminobutyric acid (GABA) A receptor, pi 
TONDU 
keratin 6B 

serine (or cysteine) proteinase inhibitor, clade B (ovalbumin), member 
5 

keratin 5 (epidermolysis bullosa simplex, Dowling- 
Meara/Kobner/Weber-Cockayne types) 
SRY (sex determining region Y)-box 10 

melanoma Inhibitory activity 

matrix metalloproteinase 7 (matrilysin, uterine) 

secreted frizzled-related protein 1 

B-cell CLL/lymphoma 1 1A (zinc finger protein) 



Hs.432448 Chr:1 7q1 2-q21 
Hs.70725 Chr:5q33-q34 
Hs.9030 Chr:Xq26.3 
Hs.432677 Chr:12q12-q13 

Hs.55279 Chr:18q21.3 



Hs.433845 
Hs.44317 

Hs. 279651 
Hs.2256 
Hs.7306 
Hs.130881 



Chr:12q12-q13 
Chr:22q13.1 
Chr:19q13.32- 
q13.33 

Chr:11q21-q22 
Chr:8p12-p11.1 
Chr:2p15 
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Homo sapiens cDNA FLJ11796 fis, clone HEMBA1006158, highly 
similar to Homo sapiens transcription factor forkhead-like 7 (FKHL7) 
gene, mRNA sequence 

solute carrier family 6 (neurotransmitter transporter), member 14 
desmuslin 
engrailed homolog 1 

rlbosomal protein, large P2 
tripartite motif-containing 29 
calmodulin-like skin protein 
desmocollin 2 

ropporin, rhophilin associated protein 

crystallin, alpha B 
tripartite motif-containing 2 

epidermal growth factor receptor (erythroblastic leukemia viral (v-erb- 
b) oncogene homolog, avian) 
leucine-rich acidic nuclear protein like 
potassium channel, subfamily K, member 5 

kallikrein 5 

procollagen C-endopeptidase enhancer 2 
Hypothetical protein [Homo sapiens], mRNA sequence 
LIM domain only 4 
keratin 17 

desmoglein 3 (pemphigus vulgaris antigen) 
keratin 6A 

-sialyltransferase-8A (alpha-N-aeetylneuraminate: alpha-2,8- 
sialytransferase, GD3 synthase) 
Kruppel-like factor 5 (intestinal) 
Rho guanine nucleotide exchange factor (GEF) 4 
kallikrein 6 (neurosin, zyme) 

prostaglandin-endoperoxide synthase 2 (prostaglandin G/H synthase 
and cyclooxygenase) 
chromosome 20 open reading frame 42 
glycoprotein M6B 
uridine phosphorylase 
ladinin 1 

pleiomorphic adenoma gene-like 1 
desmocollin 3 

Homo sapiens cDNA FLJ30869 fis, clone FEBRA2004224, mRNA 
sequence 

HRAS-llke suppressor 
cysteine and glycine-rich protein 2 
scrapie responsive protein 1 
amyloid beta (A4) precursor protein-binding, family A, member 2 (X1 1 - 

, ik0 ) Hs. 26468 Chr:15q11-q12 

jerky homolog-like (mouse) Hs.105940 Chr:1 1q21 

transforming growth factor, alpha Hs. 170009 Chr:2p13 



Hs.284186 
Hs.162211 
Hs. 10587 
Hs.271977 

Hs. 153 179 
Hs .82237 
Hs.180142 
Hs.239727 
Hs. 194093 

Hs.391270 
Hs. 12372 

Hs.77432 
Hs.71331 
Hs.1 27007 

Hs.50915 

Hs.8944 

Hs.66762 

Hs.3844 

Hs.2785 

Hs.1925 
Hs.367762 

Hs.82527 
Hs.84728 
Hs.6066 
Hs.79361 

Hs. 196384 

Hs. 180479 

Hs.5422 

Hs.77573 

Hs. 18141 

Hs.75825 

Hs.41690 

Hs.349611 
Hs.36761 
Hs.10526 
Hs.7122 



Chr:Xq23-q24 
Chr:15q26.3 
Chr:2q13-q21 
Chr:11p15.5- 
p15.4 

Chr:11q22-q23 
Chr:10p15.1 
Chr:18q12.1 
Chr:3q21.1 
Chr:11q22.3- 
q23.1 

Chr:4q31.23 

Chr:7p12 
Chr:1q21.2 
Chr:6p21 
Chr:19q13.3- 
q13.4 

Chr:3q21-q24 

Chr:1p22.3 
Chr:17q12-q21 
Chr:18q12.1- 
q12.2 

Chr:12q12-q13 

-Ghr^-12p-12^1 

p1 1.2 

Chr:13q21.32 
Chr:2q22 
Chr:19q13.3 

Chr:1q25.2-q25.3 
Chr:20p12.3 
Chr:Xp22.2 
Chr:7 

Chr:1q25.1-q32.3 
Chr:6q24-q25 
Chr:18q12.1 



Chr:3q29 

Chr:12q21.1 

Chr:4q31-q32 
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Table S6 r Genes Belonging to the NPI-ES (62 Genes) 

DC 13 protein is the only gene of NPI-ES that can be matched in Rosetta 70-gene 
'prognosis' signature (PES, see main text), out of which 42 are present in the Affymetrix 
U133Achip. 



Gene Description 



Unigene Biological Process (GO) 



Positive genes (60) (Highly Expressed In High NPI Tumors) 

adenine phosphoribosyltransferase Hs.28914 

MCM4 mlnichromosome maintenance deficient 4 (S. cerevislae) Hs. 154443 
exonuclease 1 Hs.47504 



Metallothlonein 1 H-like protein [Homo sapiens], mRNA Hs.367650 
sequence 

Homo sapiens, clone IMAGE:5270727, mRNA, mRNA Hs.319215 
sequence 

DC 13 protein Hs.6879 

HSPC037 protein Hs.433180 

H2A histone family, member Z Hs. 119192 

discs, large homoiog 7 (Orosophila) Hs.77695 

RNA helicase-related protein [Homo sapiens], mRNA sequence Hs.381097 

kinesin-like 1 Hs.8878 



chromosome 20 open reading frame 1 

KIAA0095 gene product 
helicase. lymphoid -specific 
homeo box HB9 



DNA segment on chromosome X (unique) 9879 expressed 
sequence 

MAD2 mitotic arrest deficient -like 1 (yeast) 

eukaryotic translation initiation factor 4E binding protein 1 

cathepsln C 

H2B histone family, member J 

proleasome (prosome, macropain) subunlt, beta type, 8 (large 

multifunctional protease 7) 

hypothetical protein FU20105 

chromosome 10 open reading frame 3 

uncharacterized bone marrow protein BM039 

likely ortholog of mouse gene rich cluster, C8 gene 

cell division cycle 2, G1 to S and G2 to M 

metallothlonein 2A 



Hs.9329 

Hs.155314 
Hs.203963 
HS.37035 



Hs.18212 
Hs.79078 
Hs.433317 
HS. 10029 

Hs.249216 
Hs. 180062 

Hs.89306 
Hs. 14559 
Hs.283532 
Hs.30114 
Hs.334562 

HS. 118786 



91 16 // nucleoside metabolism // 
extended:in1erred from electronic annotation; 
Pribosyltran; 5e-44 
6260 // DNA replication // 
predicted/computed 

6310 // DNA recombination // experimental 
evidence /// 6281 // DNA repair// 
experimental evidence /// 6298 // mismatch 
repair // predicted/computed 



7267 It cell-cell signaling // 
extended:Unknown; GKAP; 2.1e-05 

7067 // mitosis // experimental evidence /// 
7052 // mitotic spindle assembly // 
experimental evidence 

7067 // mitosis // predicted/computed /// 8283 
// cell proliferation // predicted/computed 



6959 // humoral immune response // 
experimental evidence Uf 6357 // regulation 
of transcription from Pol II promoter// 
predicted/computed /// 7345 // 
embryogenesis and morphogenesis // 
experimental evidence 

7067 // mitosis // predicted/computed /// 7093 
// mitotic checkpoint // experimental evidence 
6445 // regulation of translation // 
predicted/computed 

6508 // proteolysis and peptidolysis // not 
recorded /// 6955 // immune response // 
experimental evidence 

6508 // proteolysis and peptldotysis // not 
recorded 



74 // regulation of cell cycle // not recorded /// 
7089 // start control point of mitotic cell cycle 
// not recorded 

6878 // copper homeostasis // 
predicted/computed 
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gemlnin, DNA replication inhibitor 



Hs.234896 



tow density lipoprotein receptor-related protein 8, apolipoprotein Hs .54481 
e receptor 

hematological and neurological expressed 1 Hs.1 09706 

H1 histone family, member 2 Hs.7644 

nudix (nucleoside diphosphate linked moiety X)-type motif 1 Hs.388 



metallothlonein 1X 

H2B histone family, member T 

tetraspan 1 



metallothionein 1H 
H3 histone family, member K 
ribonucleotide reductase M2 polypeptide 
baculoviral 1AP repeat-containing 5 (survMn) 



F-box only protein 5 

serine (or cysteine) proteinase inhibitor, clade A (alpha- 

antiproteinase, antitrypsin), member 1 

lysosomal associated protein transmembrane 4 beta 

chemokine (C-X3-C motif) ligand 1 



C027-binding (Siva) protein 
LGN protein 

Mouse Mammary Turmor Virus Receptor homolog 1 
forkhead box Ml 

met proto -oncogene (hepalocyte growth factor receptor) 



butyrophilin, subfamily 3, member A2 
S6BI26 protein 

likely ortholog of mouse She SH2-domain binding protein 1 
H3 histone family, member B 
trefoil factor 3 (intestinal) 

immunoglobulin lambda locus 
ONA replication factor 

Homo sapiens cDNA FU30781 fis, clone FEBRA2000874, 
mRNA sequence 



Hs.374950 
Hs.247817 
Hs. 38972 



Hs.2667 
Hs.70937 
Hs.75319 
Hs.1578 



Hs.272027 

Hs.297681 

Hs.296398 
Hs.80420 



Hs.1 12058 

Hs.278338 

Hs.1 8686- 
Hs.239 

Hs.3 16752 



HS.87497 
Hs.26481 
Hs. 123253 
Hs.1 43042 
Hs.82961 

Hs.405944 
Hs.1 22908 
Hs.301663 



7050 // cell cycle arrest // 

predicted/computed /// 8156 // negative 

regulation of ONA replication // 

predicted/computed 

7165 // signal transduction // 

predicted/computed /// 6629 // lipid 

metabolism // predicted/computed 



6979 // response to oxidative stress // 
predicted/computed /// 6281 // DNA repair // 
not recorded 



8283 // cell proliferation // not recorded /// 
8583 // mystery cell fate differentiation 
(sensu Drosophita) // predicted/computed /// 
7155 //cell adhesion // not recorded /// 6928 
// cell motility // not recorded 



86 // G2/M transition of mitotic cell cycle // 
experimental evidence /// 7048 // 
oncogenesis // predicted/computed /// 691 6 // 
antl-apoptosis // experimental evidence 
6508 // proteolysis and peptidolysis // 
predicted/computed 



7165 // signal transduction // experimental 
evidence /// 6954 // inflammatory response // 
not recorded /// 6935 // chemotaxis // 
experimental evidence /// 6955 // immune 
response // not recorded /// 71 55 // cell - 
adhesion // experimental evidence /// 7267 // 
cell-cell signaling // experimental evidence 
8624 // induction of apoptosis by extracellular 
signals // predicted/computed /// 6952 // 
defense response // predicted/computed 
7186 // G-protein coupled receptor protein 
signaling pathway // predicted/computed 

6366 // transcription from Pol II promoter // 
experimental evidence /// 6979 // response to 
oxidative stress // experimental evidence 
7048 // oncogenesis // experimental evidence 
/// 8283 // cell proliferation // 
predicted/computed /// 7165 // signal 
transduction // predicted/computed 



6952 // defense response // 
predicted/computed /// 7586 // digestion // 
predicted/computed 
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chemoklne {C-C motif) ligand 18 (pulmonary and activation- 
regulated) 



Hs. 16530 



immunoglobulin kappa constant 
suppressor of Ty 4 homolog 1 (S. cerevlslae) 



paternally expressed 10 



Hs.406565 
Hs.79058 



Hs. 137476 



7165// signal transduction // experimental 
evidence /// 7154 // cell communication // 
predicted/computed /// 6935 // chemotaxis // 
experimental evidence /// 6955 // immune 
response // predicted/computed /// 6960 // 
antimicrobial humoral response (sensu 
Invertebrata) // predicted/computed /// 9607 // 
response to blotic stimulus // 
predicted/computed /// 7267 // cell-cell 
signaling // experimental evidence 

6355 // regulation of transcription, DNA- 
dependent // predicted/computed /// 6357 // 
regulation of transcription from Pol II 
promoter// predicted/computed /// 6338 // 
chromatin modeling // predicted/computed 



Negative genes (2) (Highly Expressed In Low NPI Tumors) 
BTG family, member 2 

cytochrome P450, subfamily IVF, polypeptide 8 



Hs.75462 



Hs.268554 



8285 // negative regulation of cell 
proliferation // predicted/computed /// 6281 // 
ON A repair tl predicted/computed /// 6976 // 
ON A damage response, activation of p53 // 
predicted/computed 
61 18 // electron transport // 
extended:Unknown; p450; 1.94-142/// 6693 
// prostaglandin metabolism // 
predicted/computed 
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Table S7. SAM was performed to identify 68 genes significantly associated with grade 
(FDR of 14%, >=2-fold change). 45 out of these genes (66%) are also belong to the NPI 
classifier, labeled as "YES" in the NPI-ES column. 



Gene Name NPI-ES 
Genes upregulated in Grade 3 tumors 

RAD51 -Interacting protein 

DC13 protein YES 

HSPC037 protein YES 

homeo box HB9 YES 
cyclin B2 

protein regulator of cytokinesis 1 

likely ortholog of mouse gene rich cluster, C8 gene YES 

klnesin-like 1 YES 

H2A histone family, member Z YES 

DNA replication factor 1 " 

MCM4 minlchromosome maintenance deficient 4 (S. cerevisiae) YE § 

discs, large homolog 7 (Drosophila) YES 
ZW10 interactor 

MAD2 mitotic arrest deficient-like 1 (yeast) YES 

Metallothionein 1 H-like protein [Homo sapiens], mRNA sequence YES 

chromosome 10 open reading frame 3 YES 

ribonucleotide reductase M2 polypeptide YES 

cell division cycle 2, G1 to S and G2 to M YES 

forkheadboxMI Yti> 

ycc 

uncharacterized bone marrow protein BM039 co 

YES 

helicase, lymphoid^specific 

RNA helicase-related protein [Homo sapiens], mRNA sequence YE ^ 
metallothionein 1X 

Homo sapiens, clone IMAGE:5270727 f mRNA, mRNA sequence 
metallothionein 2A 
metallothionein 1H 
KIAA0095 gene product 
baculoviral IAP repeat-containing 5 (survivin) 
geminin, DNA replication inhibitor 
enhancer of zeste homolog 2 (Drosophila) 
cathepsin C 

nudix (nucleoside diphosphate linked moiety X)-type motif 1 
hypothetical protein FLJ10719 
chemokine (C-X3-C motif) ligand 1 
tetraspan 1 

proapoptotic caspase adaptor protein 
immunoglobulin lambda locus 
H2B histone family, member J 
trefoil factor 3 (Intestinal) 
CD27-binding (Siva) protein 
topoisomerase (DNA) II alpha 170kDa 



YES 
YES 
YES 
YES 
YES 
YES 
YES 

YES 
YES 

YES 
YES 

YES 
YES 
YES 
YES 
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Immunoglobulin lambda joining 3 

eukaryotic translation Initiation factor 4E binding protein 1 

H3 histone family, member K 

chemokine (C-C motif) ligand 18 (pulmonary and activation-regulated) 
lysosomal associated protein transmembrane 4 beta 
Mouse Mammary Turmor Virus Receptor homolog 1 
LGN protein 

immunoglobulin kappa constant, 
carboxypeptldase B1 (tissue) 

met proto-oncogene (hepatocyte growth factor receptor) 

H2B histone family, member T 

RAB38, member RAS oncogene family 

H1 histone family, member 2 

hypothetical protein from EURO IMAGE 2021883 

apolipoprotein B mRNA editing enzyme, catalytic polypeptide-like 3B 

H3 histone family, member B 

immunoglobulin heavy constant gamma 3 (G3m marker) 

similar to bK246H3.1 (immunoglobulin lambda-like polypeptide 1, pre-B-cell 

specific) 

Immunoglobulin lambda light chain [Homo sapiens], mRNA sequence 
Immunoglobulin kappa light chain variable region (Homo sapiens], mRNA 
sequence 

serine (or cysteine) proteinase inhibitor, clade A (alpha- 1 antiproteinase, 
antitrypsin), member 1 

proteolipid protein 1 (Pelizaeus-Merzbacher disease, spastic paraplegia 2, 
uncomplicated) 

sodium channel, nonvoltage-gated 1, beta (Liddle syndrome) 

H4 histone family, member H 

syridecarT2 (heparan sulfate proteoglycan i , cell surface-associated, 

fibroglycan) ; 

neuropilin (NRP) and tolloid (TLL)-like 2 

Genes downregulated in Grade 3 tumors 

hypothetical protein FLJ224 1 8 • 
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Luminal A 


Luminal C 


Low NPI-ES 


30 


0 


High NPI-ES 


2 


10 



Table Sll : Correlation of Luminal A and Luminal C Tumors with High and Low NPI- 
ES Expression (Luminal Tumors were identified based upon results of Sorlie et al., 
(2001)) 



Table S12: The number of deaths (after 5 years) was then tabulated 
follows : 





H->H 


H->L 


L->L 


L->H 


Total — 


-—6 


.....A.. ._ 


._. l£L_ 


N/A 


Death 


4 


0 


3 


N/A 


AWD* 


1 


0 


2 


N/A 



*AWD: alive with disease 
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Table LI: Lookup table of IDs for Prognostic set genes 
NPI-ES 

ProbeJD GenBank Unlgene 

200853_at NMJ)02106.1 Hs.1 19192 
201483_s_at BC002802.1 Hs.79058 
201487_at NMJ301814.1 Hs.10029 
201890_at NM_001034.1 Hs.75319 
202095j3_atNMJ)01 168.1 Hs.1 578 
202188_at NM_014669.1 Hs.155314 
202580_x_atNM_021953.1 Hs.239 
202833_s_atNMJ)00295.1 Hs.297681 
203362_s_at NM J)02358.2 Hs.79078 
20351 0_at BG 170541 Hs.316752 
203687_at NMJKJ2996.1 Hs.80420 
203764_at NMJ)1 4750.1 Hs.77695 
204444_at NMJ)04523.2 Hs.8878 
204603_at NM_003686.1 Hs.47504 
204623_at NMJ)03226.1 Hs.82961 
204766_s_atNM_002452.1 Hs.388 
205240_at NM_01 3296.1 Hs.278338 
2061 1 0_at NM_003536. 1 Hs.70937 
20646 1_x_atNM_005951.1 Hs.2667 
208433_s_atNM_01 7522.1 Hs.54481 
208546_x_atNM_003524.1 Hs.249216 
20858 1_x_atNM_005952.1 Hs.374950 
208767_s_at AW 149681. Hs.296398 
209040_s_at U1 7496.1 Hs.180062 
209114^at AF133425.1 Hs.38972 

209398_at BC002649.1 Hs.7644 

209806_at BC000893.1 Hs.247817 

209832_s_at AF321 1 25. 1 Hs.122908 

209924_at AB000221.1 Hs,16530 

210052_s_atAF098158,1 Hs.9329 

210559_s_atD88357.1 Hs.334562 

2 1 0792_x_at AF0331 11.1 Hs. 1 1 2058 

21 1456_x_at AF333388.1 Hs.367850 

212094_at BE858180 Hs.1 37476 

212141_at X74794.1 Hs.154443 

212185_x_jatNM_005953.1 Hs.118786 

21 2484_at BF974389 Hs.1 8686 

212613_at AI991252 Hs.87497 

213245_at AL120173 Hs.301663 

213892_s_atAA927724 Hs.28914 

214472_at NMJ)03530.1 Hs.143042 

214614_at AI738662 Hs.37035 

21 4768_x_at BG540628 Hs.406565 

21 521 4_at H53689 Hs.405944 

217165_x_atM10943 Hs.381097 

217755_at NM_016185.1 Hs.109706 

218350_s_atNM_015895.1 Hs.234896 

94 



I 



WO 2005/033699 PCT/GB2004/004195 



218447_at NMJ)20 188.1 Hs.6879 
218542_at NM_018131.1 Hs.14559 
218875_s_atNM_012177.1 Hs.272027 
219061_s_atNMJ)06014.1 Hs.18212 
219493_at NMJ)24745.1 Hs. 123253 
219555_s_atNM_018455.1 Hs.283532 
2l9650_at NM_01 7669.1 Hs.89306 
220085_at NM_0 18063.1 Hs .203963 
220238_s_atNM_01 8846.1 Hs.26481 
221436_s_atNMJ)31299.1 Hs.30114 
221521_s_atBC003186.1 Hs.433180 
221539_at AB044548.1 Hs.433317 
222037_at AI859865 Hs.31 92 1 5 
201236_s_atNMJX)6763.1 Hs.75462 
210576_at AF133298.1 Hs.268554 
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